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Chapter 1 


Introduction 


he ease with which we recognize a face, understand spoken words, read handwrit- 

ten characters, identify our car keys in our pocket by feel, and decide whether 
an apple is ripe by its smell belies the astoundingly complex processes that underlie 
these acts of pattern recognition. Pattern recognition — the act of taking in raw 
data and taking an action based on the “category” of the pattern — has been crucial 
for our survival, and over the past tens of millions of years we have evolved highly 
sophisticated neural and cognitive systems for such tasks. 


1.1 Machine Perception 


It is natural that we should seek to design and build machines that can recognize 
patterns. From automated speech recognition, fingerprint identification, optical char- 
acter recognition, DNA sequence identification and much more, it is clear that reli- 
able, accurate pattern recognition by machine would be immensely useful. Moreover, 
in solving the myriad problems required to build such systems, we gain deeper un- 
derstanding and appreciation for pattern recognition systems in the natural world — 
most particularly in humans. For some applications, such as speech and visual recog- 
nition, our design efforts may in fact be influenced by knowledge of how these are 
solved in nature, both in the algorithms we employ and the design of special purpose 
hardware. 


1.2 An Example 


To illustrate the complexity of some of the types of problems involved, let us consider 
the following imaginary and somewhat fanciful example. Suppose that a fish packing 
plant wants to automate the process of sorting incoming fish on a conveyor belt 
according to species. As a pilot project it is decided to try to separate sea bass from 
salmon using optical sensing. We set up a camera, take some sample images and begin 
to note some physical differences between the two types of fish — length, lightness, 
width, number and shape of fins, position of the mouth, and so on — and these suggest 
features to explore for use in our classifier. We also notice noise or variations in the 
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images — variations in lighting, position of the fish on the conveyor, even “static” 
due to the electronics of the camera itself. 

Given that there truly are differences between the population of sea bass and that 

MODEL of salmon, we view them as having different models — different descriptions, which 
are typically mathematical in form. The overarching goal and approach in pattern 
classification is to hypothesize the class of these models, process the sensed data 
to eliminate noise (not due to the models), and for any sensed pattern choose the 
model that corresponds best. Any techniques that further this aim should be in the 
conceptual toolbox of the designer of pattern recognition systems. 

Our prototype system to perform this very specific task might well have the form 
shown in Fig. 1.1. First the camera captures an image of the fish. Next, the camera’s 

PRE- signals are preprocessed to simplify subsequent operations without loosing relevant 
PROCESSING information. In particular, we might use a segmentation operation in which the images 
of different fish are somehow isolated from one another and from the background. The 
SEGMENTATION information from a single fish is then sent to a feature extractor, whose purpose is to 
FEATURE reduce the data by measuring certain “features” or “properties.” These features 
EXTRACTION (Or, more precisely, the values of these features) are then passed to a classifier that 
evaluates the evidence presented and makes a final decision as to the species. 

The preprocessor might automatically adjust for average light level, or threshold 
the image to remove the background of the conveyor belt, and so forth. For the 
moment let us pass over how the images of the fish might be segmented and consider 
how the feature extractor and classifier might be designed. Suppose somebody at the 
fish plant tells us that a sea bass is generally longer than a salmon. These, then, 
give us our tentative models for the fish: sea bass have some typical length, and this 
is greater than that for salmon. Then length becomes an obvious feature, and we 
might attempt to classify the fish merely by seeing whether or not the length l of 
a fish exceeds some critical value [*. To choose /* we could obtain some design or 

TRAINING training samples of the different types of fish, (somehow) make length measurements, 

SAMPLES and inspect the results. 

Suppose that we do this, and obtain the histograms shown in Fig. 1.2. These 
disappointing histograms bear out the statement that sea bass are somewhat longer 
than salmon, on average, but it is clear that this single criterion is quite poor; no 
matter how we choose l*, we cannot reliably separate sea bass from salmon by length 
alone. 

Discouraged, but undeterred by these unpromising results, we try another feature 
— the average lightness of the fish scales. Now we are very careful to eliminate 
variations in illumination, since they can only obscure the models and corrupt our 
new classifier. The resulting histograms, shown in Fig. 1.3, are much more satisfactory 
— the classes are much better separated. 

So far we have tacitly assumed that the consequences of our actions are equally 
costly: deciding the fish was a sea bass when in fact it was a salmon was just as 

COST undesirable as the converse. Such a symmetry in the cost is often, but not invariably 
the case. For instance, as a fish packing company we may know that our customers 
easily accept occasional pieces of tasty salmon in their cans labeled “sea bass,” but 
they object vigorously if a piece of sea bass appears in their cans labeled “salmon.” 
If we want to stay in business, we should adjust our decision boundary to avoid 
antagonizing our customers, even if it means that more salmon makes its way into 
the cans of sea bass. In this case, then, we should move our decision boundary x* to 
smaller values of lightness, thereby reducing the number of sea bass that are classified 
as salmon (Fig. 1.3). The more our customers object to getting sea bass with their 
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"salmon" "sea bass" 


Figure 1.1: The objects to be classified are first sensed by a transducer (camera), 
whose signals are preprocessed, then the features extracted and finally the classifi- 
cation emitted (here either “salmon” or “sea bass”). Although the information flow 
is often chosen to be from the source to the classifier (“bottom-up”), some systems 
employ “top-down” flow as well, in which earlier levels of processing can be altered 
based on the tentative or preliminary response in later levels (gray arrows). Yet others 
combine two or more stages into a unified step, such as simultaneous segmentation 
and feature extraction. 


salmon — i.e., the more costly this type of error — the lower we should set the decision 
threshold x* in Fig. 1.3. 

Such considerations suggest that there is an overall single cost associated with our 
decision, and our true task is to make a decision rule (i.e., set a decision boundary) 
so as to minimize such a cost. This is the central task of decision theory of which 
pattern classification is perhaps the most important subfield. 

Even if we know the costs associated with our decisions and choose the optimal 
decision boundary «*, we may be dissatisfied with the resulting performance. Our 
first impulse might be to seek yet a different feature on which to separate the fish. 
Let us assume, though, that no other single visual feature yields better performance 
than that based on lightness. To improve recognition, then, we must resort to the use 
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salmon sea bass 


Count 


(p A t Length 


Figure 1.2: Histograms for the length feature for the two categories. No single thresh- 
old value 1* (decision boundary) will serve to unambiguously discriminate between 
the two categories; using length alone, we will have some errors. The value /* marked 
will lead to the smallest number of errors, on average. 


Count 


14 salmon sea bass 


0 x Lightness 
2 4 6 8 10 


Figure 1.3: Histograms for the lightness feature for the two categories. No single 
threshold value x* (decision boundary) will serve to unambiguously discriminate be- 
tween the two categories; using lightness alone, we will have some errors. The value 
x* marked will lead to the smallest number of errors, on average. 
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Width 
224 salmon 


- sea bass 


14 » Lightness 
2 4 6 8 10 


Figure 1.4: The two features of lightness and width for sea bass and salmon. The 
dark line might serve as a decision boundary of our classifier. Overall classification 
error on the data shown is lower than if we use only one feature as in Fig. 1.3, but 
there will still be some errors. 


of more than one feature at a time. 

In our search for other features, we might try to capitalize on the observation that 
sea bass are typically wider than salmon. Now we have two features for classifying 
fish — the lightness x; and the width x2. If we ignore how these features might be 
measured in practice, we realize that the feature extractor has thus reduced the image 
of each fish to a point or feature vector x in a two-dimensional feature space, where 


Our problem now is to partition the feature space into two regions, where for all 
patterns in one region we will call the fish a sea bass, and all points in the other we 
call it a salmon. Suppose that we measure the feature vectors for our samples and 
obtain the scattering of points shown in Fig. 1.4. This plot suggests the following rule 
for separating the fish: Classify the fish as sea bass if its feature vector falls above the 
decision boundary shown, and as salmon otherwise. 

This rule appears to do a good job of separating our samples and suggests that 
perhaps incorporating yet more features would be desirable. Besides the lightness 
and width of the fish, we might include some shape parameter, such as the vertex 
angle of the dorsal fin, or the placement of the eyes (as expressed as a proportion of 
the mouth-to-tail distance), and so on. How do we know beforehand which of these 
features will work best? Some features might be redundant: for instance if the eye 
color of all fish correlated perfectly with width, then classification performance need 
not be improved if we also include eye color as a feature. Even if the difficulty or 
computational cost in attaining more features is of no concern, might we ever have 
too many features? 

Suppose that other features are too expensive or expensive to measure, or provide 
little improvement (or possibly even degrade the performance) in the approach de- 
scribed above, and that we are forced to make our decision based on the two features 
in Fig. 1.4. If our models were extremely complicated, our classifier would have a 
decision boundary more complex than the simple straight line. In that case all the 
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Figure 1.5: Overly complex models for the fish will lead to decision boundaries that are 
complicated. While such a decision may lead to perfect classification of our training 
samples, it would lead to poor performance on future patterns. The novel test point 
marked ? is evidently most likely a salmon, whereas the complex decision boundary 
shown leads it to be misclassified as a sea bass. 


training patterns would be separated perfectly, as shown in Fig. 1.5. With such a 
“solution,” though, our satisfaction would be premature because the central aim of 
designing a classifier is to suggest actions when presented with novel patterns, i.e., 
fish not yet seen. This is the issue of generalization. It is unlikely that the complex 
decision boundary in Fig. 1.5 would provide good generalization, since it seems to be 
“tuned” to the particular training samples, rather than some underlying characteris- 
tics or true model of all the sea bass and salmon that will have to be separated. 

Naturally, one approach would be to get more training samples for obtaining a 
better estimate of the true underlying characteristics, for instance the probability 
distributions of the categories. In most pattern recognition problems, however, the 
amount of such data we can obtain easily is often quite limited. Even with a vast 
amount of training data in a continuous feature space though, if we followed the 
approach in Fig. 1.5 our classifier would give a horrendously complicated decision 
boundary — one that would be unlikely to do well on novel patterns. 

Rather, then, we might seek to “simplify” the recognizer, motivated by a belief 
that the underlying models will not require a decision boundary that is as complex as 
that in Fig. 1.5. Indeed, we might be satisfied with the slightly poorer performance 
on the training samples if it means that our classifier will have better performance 
on novel patterns.* But if designing a very complex recognizer is unlikely to give 
good generalization, precisely how should we quantify and favor simpler classifiers? 
How would our system automatically determine that the simple curve in Fig. 1.6 
is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicated 
boundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoff, 
can we then predict how well our system will generalize to new patterns? These are 
some of the central problems in statistical pattern recognition. 

For the same incoming patterns, we might need to use a drastically different cost 


* The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), who 
advocated favoring simpler explanations over those that are needlessly complicated — Entia non 
sunt multiplicanda praeter necessitatem (“Entities are not to be multiplied without necessity” ). 
Decisions based on overly complex models often lead to lower accuracy of the classifier. 
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Figure 1.6: The decision boundary shown might represent the optimal tradeoff be- 
tween performance on the training set and simplicity of classifier. 


function, and this will lead to different actions altogether. We might, for instance, 
wish instead to separate the fish based on their sex — all females (of either species) 
from all males if we wish to sell roe. Alternatively, we might wish to cull the damaged 
fish (to prepare separately for cat food), and so on. Different decision tasks may 
require features and yield boundaries quite different from those useful for our original 
categorization problem. 

This makes it quite clear that our decisions are fundamentally task or cost specific, 
and that creating a single general purpose artificial pattern recognition device — i.e., 
one capable of acting accurately based on a wide variety of tasks — is a profoundly 
difficult challenge. This, too, should give us added appreciation of the ability of 
humans to switch rapidly and fluidly between pattern recognition tasks. 

Since classification is, at base, the task of recovering the model that generated the 
patterns, different classification techniques are useful depending on the type of candi- 
date models themselves. In statistical pattern recognition we focus on the statistical 
properties of the patterns (generally expressed in probability densities), and this will 
command most of our attention in this book. Here the model for a pattern may be a 
single specific set of features, though the actual pattern sensed has been corrupted by 
some form of random noise. Occasionally it is claimed that neural pattern recognition 
(or neural network pattern classification) should be considered its own discipline, but 
despite its somewhat different intellectual pedigree, we will consider it a close descen- 
dant of statistical pattern recognition, for reasons that will become clear. If instead 
the model consists of some set of crisp logical rules, then we employ the methods of 
syntactic pattern recognition, where rules or grammars describe our decision. For ex- 
ample we might wish to classify an English sentence as grammatical or not, and here 
statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate. 

It was necessary in our fish example to choose our features carefully, and hence 
achieve a representation (as in Fig. 1.6) that enabled reasonably successful pattern 
classification. A central aspect in virtually every pattern recognition problem is that 
of achieving such a “good” representation, one in which the structural relationships 
among the components is simply and naturally revealed, and one in which the true 
(unknown) model of the patterns can be expressed. In some cases patterns should be 
represented as vectors of real-valued numbers, in others ordered lists of attributes, in 
yet others descriptions of parts and their relations, and so forth. We seek a represen- 
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tation in which the patterns that lead to the same action are somehow “close” to one 
another, yet “far” from those that demand a different action. The extent to which we 
create or learn a proper representation and how we quantify near and far apart will 
determine the success of our pattern classifier. A number of additional characteris- 
tics are desirable for the representation. We might wish to favor a small number of 
features, which might lead to simpler decision regions, and a classifier easier to train. 
We might also wish to have features that are robust, i.e., relatively insensitive to noise 
or other errors. In practical applications we may need the classifier to act quickly, or 
use few electronic components, memory or processing steps. 


A central technique, when we have insufficient training data, is to incorporate 
knowledge of the problem domain. Indeed the less the training data the more impor- 
tant is such knowledge, for instance how the patterns themselves were produced. One 
method that takes this notion to its logical extreme is that of analysis by synthesis, 
where in the ideal case one has a model of how each pattern is generated. Con- 
sider speech recognition. Amidst the manifest acoustic variability among the possible 
“dee”s that might be uttered by different people, one thing they have in common is 
that they were all produced by lowering the jaw slightly, opening the mouth, placing 
the tongue tip against the roof of the mouth after a certain delay, and so on. We 
might assume that “all” the acoustic variation is due to the happenstance of whether 
the talker is male or female, old or young, with different overall pitches, and so forth. 
At some deep level, such a “physiological” model (or so-called “motor” model) for 
production of the utterances is appropriate, and different (say) from that for “doo” 
and indeed all other utterances. Jf this underlying model of production can be deter- 
mined from the sound (and that is a very big if), then we can classify the utterance by 
how it was produced. That is to say, the production representation may be the “best” 
representation for classification. Our pattern recognition systems should then analyze 
(and hence classify) the input pattern based on how one would have to synthesize 
that pattern. The trick is, of course, to recover the generating parameters from the 
sensed pattern. 


Consider the difficulty in making a recognizer of all types of chairs — standard 
office chair, contemporary living room chair, beanbag chair, and so forth — based on 
an image. Given the astounding variety in the number of legs, material, shape, and 
so on, we might despair of ever finding a representation that reveals the unity within 
the class of chair. Perhaps the only such unifying aspect of chairs is functional: a 
chair is a stable artifact that supports a human sitter, including back support. Thus 
we might try to deduce such functional properties from the image, and the property 
“can support a human sitter” is very indirectly related to the orientation of the larger 
surfaces, and would need to be answered in the affirmative even for a beanbag chair. 
Of course, this requires some reasoning about the properties and naturally touches 
upon computer vision rather than pattern recognition proper. 


Without going to such extremes, many real world pattern recognition systems seek 
to incorporate at least some knowledge about the method of production of the pat- 
terns or their functional use in order to insure a good representation, though of course 
the goal of the representation is classification, not reproduction. For instance, in op- 
tical character recognition (OCR) one might confidently assume that handwritten 
characters are written as a sequence of strokes, and first try to recover a stroke rep- 
resentation from the sensed image, and then deduce the character from the identified 
strokes. 


1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 11 


1.2.1 Related fields 


Pattern classification differs from classical statistical hypothesis testing, wherein the 
sensed data are used to decide whether or not to reject a null hypothesis in favor of 
some alternative hypothesis. Roughly speaking, if the probability of obtaining the 
data given some null hypothesis falls below a “significance” threshold, we reject the 
null hypothesis in favor of the alternative. For typical values of this criterion, there is 
a strong bias or predilection in favor of the null hypothesis; even though the alternate 
hypothesis may be more probable, we might not be able to reject the null hypothesis. 
Hypothesis testing is often used to determine whether a drug is effective, where the 
null hypothesis is that it has no effect. Hypothesis testing might be used to determine 
whether the fish on the conveyor belt belong to a single class (the null hypothesis) or 
from two classes (the alternative). In contrast, given some data, pattern classification 
seeks to find the most probable hypothesis from a set of hypotheses — “this fish is 
probably a salmon.” 

Pattern classification differs, too, from image processing. In image processing, the 
input is an image and the output is an image. Image processing steps often include 
rotation, contrast enhancement, and other transformations which preserve all the 
original information. Feature extraction, such as finding the peaks and valleys of the 
intensity, lose information (but hopefully preserve everything relevant to the task at 
hand.) 

As just described, feature extraction takes in a pattern and produces feature values. 
The number of features is virtually always chosen to be fewer than the total necessary 
to describe the complete target of interest, and this leads to a loss in information. In 
acts of associative memory, the system takes in a pattern and emits another pattern 
which is representative of a general group of patterns. It thus reduces the information 
somewhat, but rarely to the extent that pattern classification does. In short, because 
of the crucial role of a decision in pattern recognition information, it is fundamentally 
an information reduction process. The classification step represents an even more 
radical loss of information, reducing the original several thousand bits representing 
all the color of each of several thousand pixels down to just a few bits representing 
the chosen category (a single bit in our fish example.) 


1.3 The Sub-problems of Pattern Classification 


We have alluded to some of the issues in pattern classification and we now turn to a 
more explicit list of them. In practice, these typically require the bulk of the research 
and development effort. Many are domain or problem specific, and their solution will 
depend upon the knowledge and insights of the designer. Nevertheless, a few are of 
sufficient generality, difficulty, and interest that they warrant explicit consideration. 


1.3.1 Feature Extraction 


The conceptual boundary between feature extraction and classification proper is some- 
what arbitrary: an ideal feature extractor would yield a representation that makes 
the job of the classifier trivial; conversely, an omnipotent classifier would not need the 
help of a sophisticated feature extractor. The distinction is forced upon us for practi- 
cal, rather than theoretical reasons. Generally speaking, the task of feature extraction 
is much more problem and domain dependent than is classification proper, and thus 
requires knowledge of the domain. A good feature extractor for sorting fish would 
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surely be of little use for identifying fingerprints, or classifying photomicrographs of 
blood cells. How do we know which features are most promising? Are there ways to 
automatically learn which features are best for the classifier? How many shall we use? 


1.3.2 Noise 


The lighting of the fish may vary, there could be shadows cast by neighboring equip- 
ment, the conveyor belt might shake — all reducing the reliability of the feature values 
actually measured. We define noise very general terms: any property of the sensed 
pattern due not to the true underlying model but instead to randomness in the world 
or the sensors. All non-trivial decision and pattern recognition problems involve noise 
in some form. In some cases it is due to the transduction in the signal and we may 
consign to our preprocessor the role of cleaning up the signal, as for instance visual 
noise in our video camera viewing the fish. An important problem is knowing some- 
how whether the variation in some signal is noise or instead to complex underlying 
models of the fish. How then can we use this information to improve our classifier? 


1.3.3 Overfitting 


In going from Fig 1.4 to Fig. 1.5 in our fish classification problem, we were, implicitly, 
using a more complex model of sea bass and of salmon. That is, we were adjusting 
the complexity of our classifier. While an overly complex model may allow perfect 
classification of the training samples, it is unlikely to give good classification of novel 
patterns — a situation known as overfitting. One of the most important areas of re- 
search in statistical pattern classification is determining how to adjust the complexity 
of the model — not so simple that it cannot explain the differences between the cat- 
egories, yet not so complex as to give poor classification on novel patterns. Are there 
principled methods for finding the best (intermediate) complexity for a classifier? 


1.3.4 Model Selection 


We might have been unsatisfied with the performance of our fish classifier in Figs. 1.4 
& 1.5, and thus jumped to an entirely different class of model, for instance one based 
on some function of the number and position of the fins, the color of the eyes, the 
weight, shape of the mouth, and so on. How do we know when a hypothesized model 
differs significantly from the true model underlying our patterns, and thus a new 
model is needed? In short, how are we to know to reject a class of models and try 
another one? Are we as designers reduced to random and tedious trial and error in 
model selection, never really knowing whether we can expect improved performance? 
Or might there be principled methods for knowing when to jettison one class of models 
and invoke another? Can we automate the process? 


1.3.5 Prior Knowledge 


In one limited sense, we have already seen how prior knowledge — about the lightness 
of the different fish categories helped in the design of a classifier by suggesting a 
promising feature. Incorporating prior knowledge can be far more subtle and difficult. 
In some applications the knowledge ultimately derives from information about the 
production of the patterns, as we saw in analysis-by-synthesis. In others the knowledge 
may be about the form of the underlying categories, or specific attributes of the 
patterns, such as the fact that a face has two eyes, one nose, and so on. 
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1.3.6 Missing Features 


Suppose that during classification, the value of one of the features cannot be deter- 
mined, for example the width of the fish because of occlusion by another fish (i.e., 
the other fish is in the way). How should the categorizer compensate? Since our 
two-feature recognizer never had a single-variable threshold value x* determined in 
anticipation of the possible absence of a feature (cf., Fig. 1.3), how shall it make the 
best decision using only the feature present? The naive method, of merely assuming 
that the value of the missing feature is zero or the average of the values for the train- 
ing patterns, is provably non-optimal. Likewise we occasionally have missing features 
during the creation or learning in our recognizer. How should we train a classifier or 
use one when some features are missing? 


1.3.7 Mereology 


We effortlessly read a simple word such as BEATS. But consider this: Why didn't 
we read instead other words that are perfectly good subsets of the full pattern, such 
as BE, BEAT, EAT, AT, and EATS? Why don't they enter our minds, unless 
explicitly brought to our attention? Or when we saw the B why didn't we read a P 
or an I, which are “there” within the B? Conversely, how is it that we can read the 
two unsegmented words in POLOPONY — without placing the entire input into a 
single word category? 

This is the problem of subsets and supersets — formally part of mereology, the 
study of part /whole relationships. It is closely related to that of prior knowledge and 
segmentation. In short, how do we recognize or group together the “proper” number 
of elements — neither too few nor too many? It appears as though the best classifiers 
try to incorporate as much of the input into the categorization as “makes sense,” but 
not too much. How can this be done? 


1.3.8 Segmentation 


In our fish example, we have tacitly assumed that the fish were isolated, separate 
on the conveyor belt. In practice, they would often be abutting or overlapping, and 
our system would have to determine where one fish ends and the next begins — the 
individual patterns have to be segmented. If we have already recognized the fish then 
it would be easier to segment them. But how can we segment the images before they 
have been categorized or categorize them before they have been segmented? It seems 
we need a way to know when we have switched from one model to another, or to know 
when we just have background or “no category.” How can this be done? 
Segmentation is one of the deepest problems in automated speech recognition. 
We might seek to recognize the individual sounds (e.g., phonemes, such as “ss,” “k,” 
...), and then put them together to determine the word. But consider two nonsense 
words, “sklee” and “skloo.” Speak them aloud and notice that for “skloo” you push 
your lips forward (so-called “rounding” in anticipation of the upcoming “oo”) before 
you utter the “ss.” Such rounding influences the sound of the “ss,” lowering the 
frequency spectrum compared to the “ss” sound in “sklee” — a phenomenon known 
as anticipatory coarticulation. Thus, the “oo” phoneme reveals its presence in the “ss” 
earlier than the “k” and “I” which nominally occur before the “oo” itself! How do we 
segment the “oo” phoneme from the others when they are so manifestly intermingled? 
Or should we even try? Perhaps we are focusing on groupings of the wrong size, and 
that the most useful unit for recognition is somewhat larger, as we saw in subsets and 
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supersets (Sect. 1.3.7). A related problem occurs in connected cursive handwritten 
character recognition: How do we know where one character “ends” and the next one 
“begins” ? 


1.3.9 Context 


We might be able to use context — input-dependent information other than from the 
target pattern itself — to improve our recognizer. For instance, it might be known 
for our fish packing plant that if we are getting a sequence of salmon, that it is highly 
likely that the next fish will be a salmon (since it probably comes from a boat that just 
returned from a fishing area rich in salmon). Thus, if after a long series of salmon our 
recognizer detects an ambiguous pattern (i.e., one very close to the nominal decision 
boundary), it may nevertheless be best to categorize it too as a salmon. We shall see 
how such a simple correlation among patterns — the most elementary form of context 
— might be used to improve recognition. But how, precisely, should we incorporate 
such information? 

Context can be highly complex and abstract. The utterance “jeetyet?” may seem 
nonsensical, unless you hear it spoken by a friend in the context of the cafeteria at 
lunchtime — “did you eat yet?” How can such a visual and temporal context influence 
your speech recognition? 


1.3.10 Invariances 


In seeking to achieve an optimal representation for a particular pattern classification 
task, we confront the problem of invariances. In our fish example, the absolute 
position on the conveyor belt is irrelevant to the category and thus our representation 
should also be insensitive to absolute position of the fish. Here we seek a representation 
that is invariant to the transformation of translation (in either horizontal or vertical 
directions). Likewise, in a speech recognition problem, it might be required only that 
we be able to distinguish between utterances regardless of the particular moment they 
were uttered; here the “translation” invariance we must ensure is in time. 

The “model parameters” describing the orientation of our fish on the conveyor 
belt are horrendously complicated — due as they are to the sloshing of water, the 
bumping of neighboring fish, the shape of the fish net, etc. — and thus we give up hope 
of ever trying to use them. These parameters are irrelevant to the model parameters 
that interest us anyway, i.e., the ones associated with the differences between the fish 
categories. Thus here we try to build a classifier that is invariant to transformations 
such as rotation. 

The orientation of the fish on the conveyor belt is irrelevant to its category. Here 
the transformation of concern is a two-dimensional rotation about the camera’s line 
of sight. A more general invariance would be for rotations about an arbitrary line in 
three dimensions. The image of even such a “simple” object as a coffee cup undergoes 
radical variation as the cup is rotated to an arbitrary angle — the handle may become 
hidden, the bottom of the inside volume come into view, the circular lip appear oval or 
a straight line or even obscured, and so forth. How might we insure that our pattern 
recognizer is invariant to such complex changes? 

The overall size of an image may be irrelevant for categorization. Such differences 
might be due to variation in the range to the object; alternatively we may be genuinely 
unconcerned with differences between sizes — a young, small salmon is still a salmon. 
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For patterns that have inherent temporal variation, we may want our recognizer 
to be insensitive to the rate at which the pattern evolves. Thus a slow hand wave and 
a fast hand wave may be considered as equivalent. Rate variation is a deep problem 
in speech recognition, of course; not only do different individuals talk at different 
rates, but even a single talker may vary in rate, causing the speech signal to change 
in complex ways. Likewise, cursive handwriting varies in complex ways as the writer 
speeds up — the placement of dots on the i’s, and cross bars on the t’s and f's, are 
the first casualties of rate increase, while the appearance of l’s and e’s are relatively 
inviolate. How can we make a recognizer that changes its representations for some 
categories differently from that for others under such rate variation? 

A large number of highly complex transformations arise in pattern recognition, 
and many are domain specific. We might wish to make our handwritten optical 
character recognizer insensitive to the overall thickness of the pen line, for instance. 
Far more severe are transformations such as non-rigid deformations that arise in three- 
dimensional object recognition, such as the radical variation in the image of your hand 
as you grasp an object or snap your fingers. Similarly, variations in illumination or 
the complex effects of cast shadows may need to be taken into account. 

The symmetries just described are continuous — the pattern can be translated, 
rotated, sped up, or deformed by an arbitrary amount. In some pattern recognition 
applications other — discrete — symmetries are relevant, such as flips left-to-right, 
or top-to-bottom. 

In all of these invariances the problem arises: How do we determine whether an 
invariance is present? How do we efficiently incorporate such knowledge into our 
recognizer? 


1.3.11 Evidence Pooling 


In our fish example we saw how using multiple features could lead to improved recog- 
nition. We might imagine that we could do better if we had several component 
classifiers. If these categorizers agree on a particular pattern, there is no difficulty. 
But suppose they disagree. How should a “super” classifier pool the evidence from the 
component recognizers to achieve the best decision? 

Imagine calling in ten experts for determining if a particular fish is diseased or 
not. While nine agree that the fish is healthy, one expert does not. Who is right? 
It may be that the lone dissenter is the only one familiar with the particular very 
rare symptoms in the fish, and is in fact correct. How would the “super” categorizer 
know when to base a decision on a minority opinion, even from an expert in one small 
domain who is not well qualified to judge throughout a broad range of problems? 


1.3.12 Costs and Risks 


We should realize that a classifier rarely exists in a vacuum. Instead, it is generally 
to be used to recommend actions (put this fish in this bucket, put that fish in that 
bucket), each action having an associated cost or risk. Conceptually, the simplest 
such risk is the classification error: what percentage of new patterns are called the 
wrong category. However the notion of risk is far more general, as we shall see. We 
often design our classifier to recommend actions that minimize some total expected 
cost or risk. Thus, in some sense, the notion of category itself derives from the cost 
or task. How do we incorporate knowledge about such risks and how will they affect 
our classification decision? 
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Finally, can we estimate the total risk and thus tell whether our classifier is ac- 
ceptable even before we field it? Can we estimate the lowest possible risk of any 
classifier, to see how close ours meets this ideal, or whether the problem is simply too 
hard overall? 


1.3.13 Computational Complexity 


Some pattern recognition problems can be solved using algorithms that are highly 
impractical. For instance, we might try to hand label all possible 20 x 20 binary pixel 
images with a category label for optical character recognition, and use table lookup 
to classify incoming patterns. Although we might achieve error-free recognition, the 
labeling time and storage requirements would be quite prohibitive since it would 
require a labeling each of 220x20 ~ 10120 patterns. Thus the computational complexity 
of different algorithms is of importance, especially for practical applications. 

In more general terms, we may ask how an algorithm scales as a function of the 
number of feature dimensions, or the number of patterns or the number of categories. 
What is the tradeoff between computational ease and performance? In some prob- 
lems we know we can design an excellent recognizer, but not within the engineering 
constraints. How can we optimize within such constraints? We are typically less 
concerned with the complexity of learning, which is done in the laboratory, than the 
complexity of making a decision, which is done with the fielded application. While 
computational complexity generally correlates with the complexity of the hypothe- 
sized model of the patterns, these two notions are conceptually different. 


This section has catalogued some of the central problems in classification. It has 
been found that the most effective methods for developing classifiers involve learning 
from examples, i.e., from a set of patterns whose category is known. Throughout this 
book, we shall see again and again how methods of learning relate to these central 
problems, and are essential in the building of classifiers. 


1.4 Learning and Adaptation 


In the broadest sense, any method that incorporates information from training sam- 
ples in the design of a classifier employs learning. Because nearly all practical or 
interesting pattern recognition problems are so hard that we cannot guess classifi- 
cation decision ahead of time, we shall spend the great majority of our time here 
considering learning. Creating classifiers then involves posit some general form of 
model, or form of the classifier, and using training patterns to learn or estimate the 
unknown parameters of the model. Learning refers to some form of algorithm for 
reducing the error on a set of training data. A range of gradient descent algorithms 
that alter a classifier’s parameters in order to reduce an error measure now permeate 
the field of statistical pattern recognition, and these will demand a great deal of our 
attention. Learning comes in several general forms. 


1.4.1 Supervised Learning 


In supervised learning, a teacher provides a category label or cost for each pattern 
in a training set, and we seek to reduce the sum of the costs for these patterns. 
How can we be sure that a particular learning algorithm is powerful enough to learn 
the solution to a given problem and that it will be stable to parameter variations? 
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How can we determine if it will converge in finite time, or scale reasonably with the 
number of training patterns, the number of input features or with the perplexity of 
the problem? How can we insure that the learning algorithm appropriately favors 
“simple” solutions (as in Fig. 1.6) rather than complicated ones (as in Fig. 1.5)? 


1.4.2 Unsupervised Learning 


In unsupervised learning or clustering there is no explicit teacher, and the system forms 
clusters or “natural groupings” of the input patterns. “Natural” is always defined 
explicitly or implicitly in the clustering system itself, and given a particular set of 
patterns or cost function, different clustering algorithms lead to different clusters. 
Often the user will set the hypothesized number of different clusters ahead of time, 
but how should this be done? How do we avoid inappropriate representations? 


1.4.3 Reinforcement Learning 


The most typical way to train a classifier is to present an input, compute its tentative 
category label, and use the known target category label to improve the classifier. For 
instance, in optical character recognition, the input might be an image of a character, 
the actual output of the classifier the category label “R,” and the desired output a “B.” 
In reinforcement learning or learning with a critic, no desired category signal is given; 
instead, the only teaching feedback is that the tentative category is right or wrong. 
This is analogous to a critic who merely states that something is right or wrong, but 
does not say specifically how it is wrong. (Thus only binary feedback is given to the 
classifier; reinforcement learning also describes the case where a single scalar signal, 
say some number between 0 and 1, is given by the teacher.) In pattern classification, 
it is most common that such reinforcement is binary — either the tentative decision 
is correct or it is not. (Of course, if our problem involves just two categories and 
equal costs for errors, then learning with a critic is equivalent to standard supervised 
learning.) How can the system learn which are important from such non-specific 
feedback? 


1.5 Conclusion 


At this point the reader may be overwhelmed by the number, complexity and mag- 
nitude of these sub-problems. Further, these sub-problems are rarely addressed in 
isolation and they are invariably interrelated. Thus for instance in seeking to reduce 
the complexity of our classifier, we might affect its ability to deal with invariance. We 
point out, though, that the good news is at least three-fold: 1) there is an “existence 
proof” that many of these problems can indeed be solved — as demonstrated by hu- 
mans and other biological systems, 2) mathematical theories solving some of these 
problems have in fact been discovered, and finally 3) there remain many fascinating 
unsolved problems providing opportunities for progress. 


Summary by Chapters 
The overall organization of this book is to address first those cases where a great deal 


of information about the models is known (such as the probability densities, category 
labels, ...) and to move, chapter by chapter, toward problems where the form of the 
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distributions are unknown and even the category membership of training patterns is 
unknown. We begin in Chap. ?? (Bayes decision theory) by considering the ideal case 
in which the probability structure underlying the categories is known perfectly. While 
this sort of situation rarely occurs in practice, it permits us to determine the optimal 
(Bayes) classifier against which we can compare all other methods. Moreover in some 
problems it enables us to predict the error we will get when we generalize to novel 
patterns. In Chap. ?? (Maximum Likelihood and Bayesian Parameter Estimation) 
we address the case when the full probability structure underlying the categories 
is not known, but the general forms of their distributions are — i.e., the models. 
Thus the uncertainty about a probability distribution is represented by the values of 
some unknown parameters, and we seek to determine these parameters to attain the 
best categorization. In Chap. ?? (Nonparametric techniques) we move yet further 
from the Bayesian ideal, and assume that we have no prior parameterized knowledge 
about the underlying probability structure; in essence our classification will be based 
on information provided by training samples alone. Classic techniques such as the 
nearest-neighbor algorithm and potential functions play an important role here. 


We then in Chap. ?? (Linear Discriminant Functions) return somewhat toward 
the general approach of parameter estimation. We shall assume that the so-called 
“discriminant functions” are of a very particular form — viz., linear — in order to de- 
rive a class of incremental training rules. Next, in Chap. ?? (Nonlinear Discriminants 
and Neural Networks) we see how some of the ideas from such linear discriminants 
can be extended to a class of very powerful algorithms such as backpropagation and 
others for multilayer neural networks; these neural techniques have a range of use- 
ful properties that have made them a mainstay in contemporary pattern recognition 
research. In Chap. ?? (Stochastic Methods) we discuss simulated annealing by the 
Boltzmann learning algorithm and other stochastic methods. We explore the behavior 
of such algorithms with regard to the matter of local minima that can plague other 
neural methods. Chapter ?? (Non-metric Methods) moves beyond models that are 
statistical in nature to ones that can be best described by (logical) rules. Here we 
discuss tree-based algorithms such as CART (which can also be applied to statistical 
data) and syntactic based methods, such as grammar based, which are based on crisp 
rules. 


Chapter ?? (Theory of Learning) is both the most important chapter and the 
most difficult one in the book. Some of the results described there, such as the 
notion of capacity, degrees of freedom, the relationship between expected error and 
training set size, and computational complexity are subtle but nevertheless crucial 
both theoretically and practically. In some sense, the other chapters can only be 
fully understood (or used) in light of the results presented here; you cannot expect to 
solve important pattern classification problems without using the material from this 
chapter. 


We conclude in Chap. ?? (Unsupervised Learning and Clustering), by addressing 
the case when input training patterns are not labeled, and that our recognizer must 
determine the cluster structure. We also treat a related problem, that of learning 
with a critic, in which the teacher provides only a single bit of information during 
the presentation of a training pattern — “yes,” that the classification provided by the 
recognizer is correct, or “no,” it isn’t. Here algorithms for reinforcement learning will 
be presented. 
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Bibliographical and Historical Remarks 


Classification is among the first crucial steps in making sense of the blooming buzzing 
confusion of sensory data that intelligent systems confront. In the western world, 
the foundations of pattern recognition can be traced to Plato [2], later extended by 
Aristotle [1], who distinguished between an “essential property” (which would be 
shared by all members in a class or “natural kind” as he put it) from an “accidental 
property” (which could differ among members in the class). Pattern recognition can 
be cast as the problem of finding such essential properties of a category. It has been a 
central theme in the discipline of philosophical epistemology, the study of the nature 
of knowledge. A more modern treatment of some philosophical problems of pattern 
recognition, relating to the technical matter in the current book can be found in 
(22, 4, 18]. In the eastern world, the first Zen patriarch, Bodhidharma, would point 
at things and demand students to answer “What is that?” as a way of confronting the 
deepest issues in mind, the identity of objects, and the nature of classification and 
decision. A delightful and particularly insightful book on the foundations of artificial 
intelligence, including pattern recognition, is [9]. 

Early technical treatments by Minsky [14] and Rosenfeld [16] are still valuable, as 
are a number of overviews and reference books [5]. The modern literature on decision 
theory and pattern recognition is now overwhelming, and comprises dozens of journals, 
thousands of books and conference proceedings and innumerable articles; it continues 
to grow rapidly. While some disciplines such as statistics [7], machine learning [17] 
and neural networks [8], expand the foundations of pattern recognition, others, such 
as computer vision [6, 19] and speech recognition [15] rely on it heavily. Perceptual 
Psychology, Cognitive Science [12], Psychobiology [21] and Neuroscience [10] analyze 
how pattern recognition is achieved in humans and other animals. The extreme view 
that everything in human cognition — including rule-following and logic — can be 
reduced to pattern recognition is presented in [13]. Pattern recognition techniques 
have been applied in virtually every scientific and technical discipline. 
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Chapter 2 


Bayesian decision theory 


2.1 Introduction 


ayesian decision theory is a fundamental statistical approach to the problem of 
B pattern classification. This approach is based on quantifying the tradeoffs be- 
tween various classification decisions using probability and the costs that accompany 
such decisions. It makes the assumption that the decision problem is posed in proba- 
bilistic terms, and that all of the relevant probability values are known. In this chapter 
we develop the fundamentals of this theory, and show how it can be viewed as being 
simply a formalization of common-sense procedures; in subsequent chapters we will 
consider the problems that arise when the probabilistic structure is not completely 
known. 

While we will give a quite general, abstract development of Bayesian decision 
theory in Sect. ??, we begin our discussion with a specific example. Let us reconsider 
the hypothetical problem posed in Chap. ?? of designing a classifier to separate two 
kinds of fish: sea bass and salmon. Suppose that an observer watching fish arrive 
along the conveyor belt finds it hard to predict what type will emerge next and that 
the sequence of types of fish appears to be random. In decision-theoretic terminology 
we would say that as each fish emerges nature is in one or the other of the two possible 
states: either the fish is a sea bass or the fish is a salmon. We let w denote the state 
of nature, with w = w, for sea bass and w = ws for salmon. Because the state of 
nature is so unpredictable, we consider w to be a variable that must be described 
probabilistically. 

If the catch produced as much sea bass as salmon, we would say that the next fish 
is equally likely to be sea bass or salmon. More generally, we assume that there is 
some a priori probability (or simply prior) P(w1) that the next fish is sea bass, and 
some prior probability P(w2) that it is salmon. If we assume there are no other types 
of fish relevant here, then P(w1) and P(w2) sum to one. These prior probabilities 
reflect our prior knowledge of how likely we are to get a sea bass or salmon before 
the fish actually appears. It might, for instance, depend upon the time of year or the 
choice of fishing area. 

Suppose for a moment that we were forced to make a decision about the type of 
fish that will appear next without being allowed to see it. For the moment, we shall 
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assume that any incorrect classification entails the same cost or consequence, and that 
the only information we are allowed to use is the value of the prior probabilities. If a 
decision must be made with so little information, it seems logical to use the following 
decision rule: Decide w: if P(w1) > P(w2); otherwise decide wa. 

This rule makes sense if we are to judge just one fish, but if we are to judge many 
fish, using this rule repeatedly may seem a bit strange. After all, we would always 
make the same decision even though we know that both types of fish will appear. 
How well it works depends upon the values of the prior probabilities. If P(w,) is very 
much greater than P(w2), our decision in favor of wı will be right most of the time. 
If P(w1) = P(w2), we have only a fifty-fifty chance of being right. In general, the 
probability of error is the smaller of P(w,) and P(w2), and we shall see later that 
under these conditions no other decision rule can yield a larger probability of being 
right. 

In most circumstances we are not asked to make decisions with so little informa- 
tion. In our example, we might for instance use a lightness measurement x to improve 
our classifier. Different fish will yield different lightness readings and we express this 
variability in probabilistic terms; we consider x to be a continuous random variable 
whose distribution depends on the state of nature, and is expressed as p(a|w1).* This 
is the class-conditional probability density function. Strictly speaking, the probabil- 
ity density function p(x|w1) should be written as px(a|w1) to indicate that we are 
speaking about a particular density function for the random variable X. This more 
elaborate subscripted notation makes it clear that px (+) and py (+) denote two differ- 
ent functions, a fact that is obscured when writing p(x) and p(y). Since this potential 
confusion rarely arises in practice, we have elected to adopt the simpler notation. 
Readers who are unsure of our notation or who would like to review probability the- 
ory should see Appendix ??). This is the probability density function for x given that 
the state of nature is w1. (It is also sometimes called state-conditional probability 
density.) Then the difference between p(x[w,) and p(x|w2) describes the difference in 
lightness between populations of sea bass and salmon (Fig. 2.1). 

Suppose that we know both the prior probabilities P(w;) and the conditional 
densities p(z|w,;). Suppose further that we measure the lightness of a fish and discover 
that its value is x. How does this measurement influence our attitude concerning the 
true state of nature — that is, the category of the fish? We note first that the (joint) 
probability density of finding a pattern that is in category wj and has feature value x 
can be written two ways: p(w;,x) = P(w,|x)p(x) = p(xlw,)P(w,). Rearranging these 
leads us to the answer to our question, which is called Bayes’ formula: 


p(x|w;)P(w;) 


Pojla) = LEE 


(1) 


where in this case of two categories 


2 


plz) = >) p(x|wj)P(wy). (2) 


j=1 


Bayes’ formula can be expressed informally in English by saying that 


likelihood x prior 


(3) 


* We generally use an upper-case P(-) to denote a probability mass function and a lower-case p(-) 
to denote a probability density function. 


posterior = 


evidence 
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Bayes’ formula shows that by observing the value of x we can convert the prior 
probability P(w,) to the a posteriori probability (or posterior) probability P(w;lx) 
— the probability of the state of nature being w; given that feature value x has 
been measured. We call p(xlw;) the likelihood of wj with respect to x (a term 
chosen to indicate that, other things being equal, the category w; for which p(a|w,) 
is large is more “likely” to be the true category). Notice that it is the product of the 
likelihood and the prior probability that is most important in determining the psterior 
probability; the evidence factor, p(x), can be viewed as merely a scale factor that 
guarantees that the posterior probabilities sum to one, as all good probabilities must. 
The variation of P(w,;|x) with z is illustrated in Fig. 2.2 for the case P(w1) = 2/3 and 
P(w2) = 1/3. 


pao) 


=X 


9 10 11 12 13 14 15 


Figure 2.1: Hypothetical class-conditional probability density functions show the 
probability density of measuring a particular feature value x given the pattern is 
in category wi. If x represents the length of a fish, the two curves might describe 
the difference in length of populations of two types of fish. Density functions are 
normalized, and thus the area under each curve is 1.0. 


If we have an observation x for which P(wı|x) is greater than P(wa|x), we would 
naturally be inclined to decide that the true state of nature is w1. Similarly, if P(w2|) 
is greater than P(w1|x), we would be inclined to choose wa. To justify this decision 
procedure, let us calculate the probability of error whenever we make a decision. 
Whenever we observe a particular x, 


_ | Plwlzx) if we decide w2 
ida { P(wa|x) if we decide w1. (a 


Clearly, for a given x we can minimize the probability of error by deciding w if 
P(wilz) > P(ws|x) and wa otherwise. Of course, we may never observe exactly the 
same value of x twice. Will this rule minimize the average probability of error? Yes, 
because the average probability of error is given by 
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P(w|x) 
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Figure 2.2: Posterior probabilities for the particular priors P(w1) = 2/3 and P(w2) = 
1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in this case, 
given that a pattern is measured to have feature value x = 14, the probability it is 
in category wa is roughly 0.08, and that it is in w; is 0.92. At every x, the posteriors 
sum to 1.0. 


oo 


P(error) = J P(error,x) dz = J P(error|x)p(x) dx (5) 


and if for every x we insure that P(error|x) is as small as possible, then the integral 
must be as small as possible. Thus we have justified the following Bayes’ decision 
rule for minimizing the probability of error: 


Decide w1 if Plwilx) > P(wə|x); otherwise decide wa, (6) 


and under this rule Eq. 4 becomes 


P(error|x) = min [P(w1|x), P(w2|x)). (7) 


This form of the decision rule emphasizes the role of the posterior probabilities. 
By using Eq. 1, we can instead express the rule in terms of the conditional and prior 
probabilities. First note that the evidence, p(x), in Eq. 1 is unimportant as far as 
making a decision is concerned. It is basically just a scale factor that states how 
frequently we will actually measure a pattern with feature value x; its presence in 
Eq. 1 assures us that P(w,|x) + P(wlx) = 1. By eliminating this scale factor, we 
obtain the following completely equivalent decision rule: 


Decide wy if p(a|w1)P(w1) > p(xlw2)P(w2); otherwise decide w2. (8) 


Some additional insight can be obtained by considering a few special cases. If 
for some x we have p(z|w1) = p(x|w2), then that particular observation gives us no 
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information about the state of nature; in this case, the decision hinges entirely on the 
prior probabilities. On the other hand, if P(w1) = P(w2), then the states of nature are 
equally probable; in this case the decision is based entirely on the likelihoods p(x|w;). 
In general, both of these factors are important in making a decision, and the Bayes 
decision rule combines them to achieve the minimum probability of error. 


2.2 Bayesian Decision Theory — Continuous Fea- 
tures 


We shall now formalize the ideas just considered, and generalize them in four ways: 
e by allowing the use of more than one feature 
e by allowing more than two states of nature 
e by allowing actions other than merely deciding the state of nature 
e by introducing a loss function more general than the probability of error. 


These generalizations and their attendant notational complexities should not ob- 
scure the central points illustrated in our simple example. Allowing the use of more 
than one feature merely requires replacing the scalar x by the feature vector x, where 
x is in a d-dimensional Euclidean space R4, called the feature space. Allowing more 
than two states of nature provides us with a useful generalization for a small notational 
expense. Allowing actions other than classification primarily allows the possibility of 
rejection, i.e., of refusing to make a decision in close cases; this is a useful option if 
being indecisive is not too costly. Formally, the loss function states exactly how costly 
each action is, and is used to convert a probability determination into a decision. Cost 
functions let us treat situations in which some kinds of classification mistakes are more 
costly than others, although we often discuss the simplest case, where all errors are 
equally costly. With this as a preamble, let us begin the more formal treatment. 

Let w1,..., 4. be the finite set of c states of nature (“categories”) and a1, ..., Ya 
be the finite set of a possible actions. The loss function A(a;|w;) describes the loss 
incurred for taking action a; when the state of nature is wj. Let the feature vector 
x be a d-component vector-valued random variable, and let p(x|w;) be the state- 
conditional probability density function for x — the probability density function for 
x conditioned on wj being the true state of nature. As before, P(w;) describes the 
prior probability that nature is in state wj. Then the posterior probability P(w,|x) 
can be computed from p(x|w,;) by Bayes’ formula: 


x|w;)P(w;) 


Pap) = PRR) 6) 


where the evidence is now 


C 
P(x) = Y p(x|ws) P(wy). (10) 
j=l 
Suppose that we observe a particular x and that we contemplate taking action 
ai. If the true state of nature is wj, by definition we will incur the loss A(a;ļ|w;). 
Since P(w,|x) is the probability that the true state of nature is wj, the expected loss 
associated with taking action a; is merely 
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Cc 
R(ailx) = d A(alw;)P(w;|x). (11) 
j=1 

In decision-theoretic terminology, an expected loss is called a risk, and R(a;|x) is 
called the conditional risk. Whenever we encounter a particular observation x, we can 
minimize our expected loss by selecting the action that minimizes the conditional risk. 
We shall now show that this Bayes decision procedure actually provides the optimal 
performance on an overall risk. 

Stated formally, our problem is to find a decision rule against P(w,;) that mini- 
mizes the overall risk. A general decision rule is a function a(x) that tells us which 
action to take for every possible observation. To be more specific, for every x the 
decision function a(x) assumes one of the a values aj,...,@q. The overall risk R is the 
expected loss associated with a given decision rule. Since R(a;|x) is the conditional 
risk associated with action a;, and since the decision rule specifies the action, the 
overall risk is given by 


R= J Reco) dx, (12) 


where dx is our notation for a d-space volume element, and where the integral extends 
over the entire feature space. Clearly, if a(x) is chosen so that R(a;(x)) is as small 
as possible for every x, then the overall risk will be minimized. This justifies the 
following statement of the Bayes decision rule: To minimize the overall risk, compute 
the conditional risk 


R(ai|x) = Y Aoslwj)P (wlx) (13) 


for i = 1,...,a and select the action a; for which R(a;|x) is minimum.” The resulting 
minimum overall risk is called the Bayes risk, denoted R*, and is the best performance 
that can be achieved. 


2.2.1 Two-Category Classification 


Let us consider these results when applied to the special case of two-category classifi- 
cation problems. Here action a, corresponds to deciding that the true state of nature 
is 1, and action (+3 corresponds to deciding that it is w2. For notational simplicity, 
let Ax; = A(ai|w;) be the loss incurred for deciding w; when the true state of nature 
is wj. If we write out the conditional risk given by Eq. 13, we obtain 


R(a4|x) àP (wlx) + A12P(we2|x) and 
R(a2|x) = A21 P(w1|x) =F A22P (wa|x). (14) 


There are a variety of ways of expressing the minimum-risk decision rule, each 
having its own minor advantages. The fundamental rule is to decide w1 if R(ai|x) < 
R(a2|x). In terms of the posterior probabilities, we decide wy if 


(A21 = A11)P(w1|x) > (ua = A22)P(wa|x). (15) 


* Note that if more than one action minimizes R(a|x), it does not matter which of these actions is 
taken, and any convenient tie-breaking rule can be used. 
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Ordinarily, the loss incurred for making an error is greater than the loss incurred for 
being correct, and both of the factors A21 — A11 and A12 — A22 are positive. Thus in 
practice, our decision is generally determined by the more likely state of nature, al- 
though we must scale the posterior probabilities by the loss differences. By employing 
Bayes’ formula, we can replace the posterior probabilities by the prior probabilities 
and the conditional densities. This results in the equivalent rule, to decide wy if 


(A21 = Ari) p(xlw1)P(w1) > (A2 — A22)p(x|w2)P (w2), (16) 


and wa otherwise. 
Another alternative, which follows at once under the reasonable assumption that 
A21 > A11, is to decide wy if 


p(xlw1) _ A12 — A22 P(we) 
p(x|w2) A21 — Ait P(w) 


This form of the decision rule focuses on the x-dependence of the probability densities. 
We can consider p(x|w,;) a function of w; (i.e., the likelihood function), and then form 
the likelihood ratio p(x|wı)/p(x|w2). Thus the Bayes decision rule can be interpreted 
as calling for deciding w: if the likelihood ratio exceeds a threshold value that is 
independent of the observation x. 


(17) 


2.3 Minimum-Error-Rate Classification 


In classification problems, each state of nature is usually associated with a different 
one of the c classes, and the action a, is usually interpreted as the decision that the 
true state of nature is w;. If action a; is taken and the true state of nature is wj, then 
the decision is correct if i = j, and in error if i 4 j. If errors are to be avoided, it is 
natural to seek a decision rule that minimizes the probability of error, i.e., the error 
rate. 

The loss function of interest for this case is hence the so-called symmetrical or 
zero-one loss function, 


0 i=J a 
Ao lw;) = { 1 ae == jc (18) 
This loss function assigns no loss to a correct decision, and assigns a unit loss to any 
error; thus, all errors are equally costly.* The risk corresponding to this loss function 
is precisely the average probability of error, since the conditional risk is 


II 


R(ai|x) DE Aa lw;)P(w;1x) 


= Piw) 
j#i 
= 1-P(u;|x) (19) 


* We note that other loss functions, such as quadratic and linear-difference, find greater use in 
regression tasks, where there is a natural ordering on the predictions and we can meaningfully 
penalize predictions that are “more wrong” than others. 


LIKELIHOOD 
RATIO 
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and P(w;|x) is the conditional probability that action a; is correct. The Bayes decision 
rule to minimize risk calls for selecting the action that minimizes the conditional 
risk. Thus, to minimize the average probability of error, we should select the 7 that 
maximizes the posterior probability P(w;|x). In other words, for minimum error rate: 


Decide w; if P(w;|x) > P(w;|x) for all j Ai. (20) 


This is the same rule as in Eq. 6. 

We saw in Fig. 2.2 some class-conditional probability densities and the posterior 
probabilities; Fig. 2.3 shows the likelihood ratio p(#|w1)/p(a|w2) for the same case. In 
general, this ratio can range between zero and infinity. The threshold value 6, marked 
is from the same prior probabilities but with zero-one loss function. Notice that this 
leads to the same decision boundaries as in Fig. 2.2, as it must. If we penalize mistakes 
in classifying wı patterns as w2 more than the converse (i.e., 421 > A12), then Eq. 17 
leads to the threshold 6, marked. Note that the range of x values for which we classify 
a pattern as w; gets smaller, as it should. 


p(aliw,) 
P(2|.) 


R 


2 


Figure 2.3: The likelihood ratio p(a|w,) /p(a|w2) for the distributions shown in Fig. 2.1. 
If we employ a zero-one or classification loss, our decision boundaries are determined 
by the threshold 0a. If our loss function penalizes miscategorizing w2 as w; patterns 
more than the converse, (i.e., A12 > A21), we get the larger threshold 6,, and hence 
Ri becomes smaller. 


2.3.1 *Minimax Criterion 


Sometimes we must design our classifier to perform well over a range of prior proba- 
bilities. For instance, in our fish categorization problem we can imagine that whereas 
the physical properties of lightness and width of each type of fish remain constant, the 
prior probabilities might vary widely and in an unpredictable way, or alternatively 
we want to use the classifier in a different plant where we do not know the prior 
probabilities. A reasonable approach is then to design our classifier so that the worst 
overall risk for any value of the priors is as small as possible — that is, minimize the 
maximum possible overall risk. 
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In order to understand this, we let Ry denote that (as yet unknown) region in 
feature space where the classifier decides w; and likewise for Rə and wa, and then 
write our overall risk Eq. 12 in terms of conditional risks: 


R =f Ru Pler) polos) + 2122 (02) plxen)] dx 
Ri 


+f ParPler) plo) + dea P(w2) ploclun)] dx (21) 
Ra 


We use the fact that P(w2) = 1 — P(w1) and that f p(x|w1) dx = 1-— f p(xlw1) dx 
Ra Ra 


to rewrite the risk as: 


= Rmm, Minimax risk 


R(P(wi)) = A22+(A12 — A22) frei) dx 
Ri 


+ P(w) (Aun a A22) — (A21 = M1) fan) dx — (A12 = Az) pak) dx 


Ra Ri 


= 0 for minimax solution 


This equation shows that once the decision boundary is set (i.e., Ry and Ra 
determined), the overall risk is linear in P(w1). If we can find a boundary such that 
the constant of proportionality is 0, then the risk is independent of priors. This is the 
minimax solution, and the minimax risk, Rmm, can be read from Eq. 22: 


Rmm = A22 + (A12 — A22) frei) dx 
Ri 
= A11 a (A21 = A11) freir) dx. (23) 
Ra 


Figure 2.4 illustrates the approach. Briefly stated, we search for the prior for which 
the Bayes risk is maximum, the corresponding decision boundary gives the minimax 
solution. The value of the minimax risk, Rmm, is hence equal to the worst Bayes risk. 
In practice, finding the decision boundary for minimax risk may be difficult, partic- 
ularly when distributions are complicated. Nevertheless, in some cases the boundary 
can be determined analytically (Problem 3). 

The minimax criterion finds greater use in game theory then it does in traditional 
pattern recognition. In game theory, you have a hostile opponent who can be expected 
to take an action maximally detrimental to you. Thus it makes great sense for you to 
take an action (e.g., make a classification) where your costs — due to your opponent’s 
subsequent actions — are minimized. 


(22) 
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Figure 2.4: The curve at the bottom shows the minimum (Bayes) error as a function of 
prior probability P(w,) in a two-category classification problem of fixed distributions. 
For each value of the priors (e.g., P(w1) = 0.25) there is a corresponding optimal 
decision boundary and associated Bayes error rate. For any (fixed) such boundary, if 
the priors are then changed, the probability of error will change as a linear function of 
P(w1) (shown by the dashed line). The maximum such error will occur at an extreme 
value of the prior, here at P(w,) = 1. To minimize the maximum of such error, we 
should design our decision boundary for the maximum Bayes error (here P(w1) = 0.6), 
and thus the error will not change as a function of prior, as shown by the solid red 
horizontal line. 


2.3.2 *Neyman-Pearson Criterion 


In some problems, we may wish to minimize the overall risk subject to a constraint; 
for instance, we might wish to minimize the total risk subject to the constraint 
J R(ai|x) dx < constant for some particular i. Such a constraint might arise when 
there is a fixed resource that accompanies one particular action a;, or when we must 
not misclassify pattern from a particular state of nature w; at more than some limited 
frequency. For instance, in our fish example, there might be some government regu- 
lation that we must not misclassify more than 1% of salmon as sea bass. We might 
then seek a decision that minimizes the chance of classifying a sea bass as a salmon 
subject to this condition. 


We generally satisfy such a Neyman-Pearson criterion by adjusting decision bound- 
aries numerically. However, for Gaussian and some other distributions, Neyman- 
Pearson solutions can be found analytically (Problems 5 & 6). We shall have cause 
to mention Neyman-Pearson criteria again in Sect. 2.8.3 on operating characteristics. 
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2.4 Classifiers, Discriminant Functions and Deci- 
sion Surfaces 


2.4.1 The Multi-Category Case 
There are many different ways to represent pattern classifiers. One of the most useful 


is in terms of a set of discriminant functions g;(x), i = 1,...,c. The classifier is said 
to assign a feature vector x to class w; if 


gi(X) > 95 (x) for all j Ai. (24) 


Thus, the classifier is viewed as a network or machine that computes c discriminant 
functions and selects the category corresponding to the largest discriminant. A net- 
work representation of a classifier is illustrated in Fig. 2.5. 


Action 
(e.g., classification) 


Discriminant 
functions 


Figure 2.5: The functional structure of a general statistical pattern classifier which 
includes d inputs and c discriminant functions g;(x). A subsequent step determines 
which of the discriminant values is the maximum, and categorizes the input pat- 
tern accordingly. The arrows show the direction of the flow of information, though 
frequently the arrows are omitted when the direction of flow is self-evident. 


A Bayes classifier is easily and naturally represented in this way. For the gen- 
eral case with risks, we can let g;(x) = —R(a;|x), since the maximum discriminant 
function will then correspond to the minimum conditional risk. For the minimum- 
error-rate case, we can simplify things further by taking g;(x) = P(w;|x), so that the 
maximum discriminant function corresponds to the maximum posterior probability. 

Clearly, the choice of discriminant functions is not unique. We can always multiply 
all the discriminant functions by the same positive constant or shift them by the same 
additive constant without influencing the decision. More generally, if we replace every 
gi(x) by f(gi(x)), where f(-) is a monotonically increasing function, the resulting 
classification is unchanged. This observation can lead to significant analytical and 
computational simplifications. In particular, for minimum-error-rate classification, 
any of the following choices gives identical classification results, but some can be 
much simpler to understand or to compute than others: 


DECISION 
REGION 
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p(x|wi) P(w) 


glx) = P(wi|x) = = (25) 
A PGs) 

gi(x) = p(x|wi) P(wi) (26) 

gi(x) = ln p(x|w;) + In P(w;), (27) 


where In denotes natural logarithm. 

Even though the discriminant functions can be written in a variety of forms, the 
decision rules are equivalent. The effect of any decision rule is to divide the feature 
space into c decision regions, Ri,- Re. If gi(x) > gj;(x) for all j 4 i, then x is in 
Ri, and the decision rule calls for us to assign x to w;. The regions are separated 
by decision boundaries, surfaces in feature space where ties occur among the largest 
discriminant functions (Fig. 2.6). 


Decision 
Boundary 


Figure 2.6: In this two-dimensional two-category classifier, the probability densities 
are Gaussian (with 1/e ellipses shown), the decision boundary consists of two hyper- 
bolas, and thus the decision region Ra is not simply connected. 


2.4.2 The Two-Category Case 


While the two-category case is just a special instance of the multicategory case, it has 
traditionally received separate treatment. Indeed, a classifier that places a pattern in 
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one of only two categories has a special name — a dichotomizer.* Instead of using two 
discriminant functions gı and ga and assigning x to w1 if gı > ga, it is more common 
to define a single discriminant function 


g(x) = gi (x) — ga(x), (28) 


and to use the following decision rule: Decide w if g(x) > 0; otherwise decide wa. 
Thus, a dichotomizer can be viewed as a machine that computes a single discriminant 
function g(x), and classifies x according to the algebraic sign of the result. Of the 
various forms in which the minimum-error-rate discriminant function can be written, 
the following two (derived from Eqs. 25 & 27) are particularly convenient: 


g(x) = P(wi|x) — P(we|x) (29) 


_ Pda) 
DE CITA Pla 


2.5 The Normal Density 


The structure of a Bayes classifier is determined by the conditional densities p(x|w;) 
as well as by the prior probabilities. Of the various density functions that have 
been investigated, none has received more attention than the multivariate normal or 
Gaussian density. To a large extent this attention is due to its analytical tractability. 
However the multivariate normal density is also an appropriate model for an important 
situation, viz., the case where the feature vectors x for a given class w; are continuous 
valued, randomly corrupted versions of a single typical or prototype vector u;. In this 
section we provide a brief exposition of the multivariate normal density, focusing on 
the properties of greatest interest for classification problems. 

First, recall the definition of the expected value of a scalar function f(x), defined 
for some density p(x): 


Elf(«)] = / f(a)p(a)de. (31) 


If we have samples in a set D from a discrete distribution, we must sum over all 
samples as 


Elf) = $ f(z) Pt), (32) 


TED 


where P(x) is the probability mass at x. We shall often have call to calculate expected 
values — by these and analogous equations defined in higher dimensions (see Appendix 
Secs. ??, ?? & ?7).* 


* A classifier for more than two categories is called a polychotomizer. 

* We will often use somewhat loose engineering terminology and refer to a single point as a “sample.” 
Statisticians, though, always refer to a sample as a collection of points, and discuss “a sample of 
size n.” When taken in context, there are rarely ambiguities in such usage. 
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2.5.1 Univariate Density 


We begin with the continuous univariate normal or Gaussian density, 


pl) = = oo EN (33) 


for which the expected value of x (an average, here taken over the feature space) is 


w= ele] = fole) de, (34) 
and where the expected squared deviation or variance is 


Co 


0? = El-u?) = frota) dz, (35) 


—=00 


The univariate normal density is completely specified by two parameters: its mean 
u and variance o?. For simplicity, we often abbreviate Eq. 33 by writing p(x) ~ 
N(p, 0?) to say that x is distributed normally with mean y and variance 0?. Samples 
from normal distributions tend to cluster about the mean, with a spread related to 
the standard deviation o (Fig. 2.7). 
p(x) 


2.5% 2.5% 


u-260 u-o u u+o  u+20 


Figure 2.7: A univariate normal distribution has roughly 95% of its area in the range 
|x — u| < 20, as shown. The peak of the distribution has value p(w) = 1/Y2r0. 


There is a deep relationship between the normal distribution and entropy. We 
shall consider entropy in greater detail in Chap. ??, but for now we merely state that 
the entropy of a distribution is given by 


H(p(2)) = — J p(z) In p(x) de, (36) 


and measured in nats. If a log, is used instead, the unit is the bit. The entropy is a non- 
negative quantity that describes the fundamental uncertainty in the values of points 
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selected randomly from a distribution. It can be shown that the normal distribution 
has the maximum entropy of all distributions having a given mean and variance 
(Problem 20). Moreover, as stated by the Central Limit Theorem, the aggregate 
effect of a large number of small, independent random disturbances will lead to a 
Gaussian distribution (Computer exercise ??). Because many patterns — from fish 
to handwritten characters to some speech sounds — can be viewed as some ideal or 
prototype pattern corrupted by a large number of random processes, the Gaussian is 
often a good model for the actual probability distribution. 


2.5.2 Multivariate Density 


The general multivariate normal density in d dimensions is written as 


D) = arses -3 WE m), (37) 


where x is a d-component column vector, p is the d-component mean vector, X is the 
d-by-d covariance matriz, |X| and E”! are its determinant and inverse, respectively, 
and (x — ys)’ is the transpose of x — w.* Our notation for the inner product is 


d 
a'b = X` aibi, (38) 
i=1 


and often called a dot product. 
For simplicity, we often abbreviate Eq. 37 as p(x) ~ N(u, ©). Formally, we have 


u =€lx] = [ve dx (39) 


and 


Y = El(x— p)(x- u)'] = fo — p)(x — p)'p(x) dx, (40) 


where the expected value of a vector or a matrix is found by taking the expected 
values of its components. In other words, if x; is the ith component of x, p; the ith 
component of yz, and oj; the ijth component of Y, then 


ui = Ela] (41) 


and 


Gij = El(xi — pi); — My). (42) 

The covariance matrix X is always symmetric and positive semidefinite. We shall 
restrict our attention to the case in which > is positive definite, so that the deter- 
minant of Y is strictly positive.’ The diagonal elements o;; are the variances of the 
respective x; (i.e., 77), and the off-diagonal elements o;; are the covariances of x; and 
xj. We would expect a positive covariance for the length and weight features of a 
population of fish, for instance. If x; and x; are statistically independent, oij = 0. If 


* The mathematical expressions for the multivariate normal density are greatly simplified by em- 


ploying the concepts and notation of linear algebra. Readers who are unsure of our notation or 
who would like to review linear algebra should see Appendix ??. 

+ If sample vectors are drawn from a linear subspace, |X| = 0 and p(x) is degenerate. This occurs, 
for example, when one component of x has zero variance, or when two components are identical 
or multiples of one another. 
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all the off-diagonal elements are zero, p(x) reduces to the product of the univariate 
normal densities for the components of x. 

Linear combinations of jointly normally distributed random variables, independent 
or not, are normally distributed. In particular, if A is a d-by-k matrix and y = Atx 
is a k-component vector, then p(y) ~ N(A*p, A'XA), as illustrated in Fig. 2.8. In 
the special case where k = 1 and A is a unit-length vector a, y = atx is a scalar that 
represents the projection of x onto a line in the direction of a; in that case aí Ya is the 
variance of the projection of x onto a. In general then, knowledge of the covariance 
matrix allows us to calculate the dispersion of the data in any direction, or in any 
subspace. 

It is sometimes convenient to perform a coordinate transformation that converts 
an arbitrary multivariate normal distribution into a spherical one, i.e., one having a 
covariance matrix proportional to the identity matrix I. If we define ® to be the ma- 
trix whose columns are the orthonormal eigenvectors of X, and A the diagonal matrix 
of the corresponding eigenvalues, then the transformation A, = PA"? applied to 
the coordinates insures that the transformed distribution has covariance matrix equal 
to the identity matrix. In signal processing, the transform Aw» is called a whiten- 
ing transformation, since it makes the spectrum of eigenvectors of the transformed 
distribution uniform. 

The multivariate normal density is completely specified by d + d(d + 1)/2 pa- 
rameters — the elements of the mean vector ys and the independent elements of the 
covariance matrix X. Samples drawn from a normal population tend to fall in a single 
cloud or cluster (Fig. 2.9); the center of the cluster is determined by the mean vector, 
and the shape of the cluster is determined by the covariance matrix. If follows from 
Eq. 37 that the loci of points of constant density are hyperellipsoids for which the 
quadratic form (x— u) E7*(x— p) is constant. The principal axes of these hyperellip- 
soids are given by the eigenvectors of © (described by $); the eigenvalues (described 
by A) determine the lengths of these axes. The quantity 


r? = (xp Exp) (43) 


is sometimes called the squared Mahalanobis distance from x to u. Thus, the contours 
of constant density are hyperellipsoids of constant Mahalanobis distance to u and the 
volume of these hyperellipsoids measures the scatter of the samples about the mean. It 
can be shown (Problems 15 & 16) that the volume of the hyperellipsoid corresponding 
to a Mahalanobis distance r is given by 


V =V E Pr? (44) 


where Vz is the volume of a d-dimensional unit hypersphere: 


7/2 /(4/2) d even 
mM (45) 
2 PE (a) d odd. 


Thus, for a given dimensionality, the scatter of the samples varies directly with |3]*/2 
(Problem 17). 
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Figure 2.8: The action of a linear transformation on the feature space will convert an 
arbitrary normal distribution into another normal distribution. One transformation, 
A, takes the source distribution into distribution N (Atu, ASA). Another linear 
transformation — a projection P onto line a — leads to N (u, 0?) measured along a. 
While the transforms yield distributions in a different space, we show them super- 
imposed on the original 7; — x2 space. A whitening transform leads to a circularly 
symmetric Gaussian, here shown displaced. 


2.6 Discriminant Functions for the Normal Density 


In Sect. 2.4.1 we saw that the minimum-error-rate classification can be achieved by 
use of the discriminant functions 


gi(x) = In p(x|w;) + In P(w;). (46) 


This expression can be readily evaluated if the densities p(x|w;) are multivariate nor- 
mal, i.e., if p(x|w;) ~ N(u,, ;). In this case, then, from Eq. 37 we have 


1 d 1 
gi(x) = 27 — m; E; (x — py) 3 In 27 5 In [2,| + In P(w;). (47) 


Let us examine the discriminant function and resulting classification for a number of 
special cases. 


2.6.1 Case 1: Y, = 0?1 


The simplest case occurs when the features are statistically independent, and when 
each feature has the same variance, o°. In this case the covariance matrix is diagonal, 
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x] 


Figure 2.9: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on 
the mean u. The red ellipses show lines of equal probability density of the Gaussian. 


being merely o? times the identity matrix I. Geometrically, this corresponds to the 
situation in which the samples fall in equal-size hyperspherical clusters, the cluster 
for the ¿th class being centered about the mean vector u;. The computation of the 
determinant and the inverse of X; is particularly easy: |©;| = 0? and E; * = (1/0?)L 
Since both |X;| and the (d/2) In 27 term in Eq. 47 are independent of i, they are 
unimportant additive constants that can be ignored. Thus we obtain the simple 
discriminant functions 


yy 12 
g(x) = PIE in Pu), (48) 
20? 
where ||- || is the Euclidean norm, that is, 
|x — wall? = (x = 1) (x — i). (49) 


If the prior probabilities are not equal, then Eq. 48 shows that the squared distance 
\|x — e]? must be normalized by the variance o? and offset by adding In P(w;); thus, 
if x is equally near two different mean vectors, the optimal decision will favor the a 
priori more likely category. 

Regardless of whether the prior probabilities are equal or not, it is not actually 
necessary to compute distances. Expansion of the quadratic form (x — p;)*(x — py) 
yields 


1 
gi(x) = -zx — 24x + pip] +n P(w), (50) 


which appears to be a quadratic function of x. However, the quadratic term x*x is 
the same for all ¿, making it an ignorable additive constant. Thus, we obtain the 
equivalent linear discriminant functions 


gi(X) = W¡x + wio, (51) 


where 
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Figure 2.10: If the covariances of two distributions are equal and proportional to the 
identity matrix, then the distributions are spherical in d dimensions, and the boundary 
is a generalized hyperplane of d — 1 dimensions, perpendicular to the line separating 
the means. In these 1-, 2-, and 3-dimensional examples, we indicate p(x|w;) and the 
boundaries for the case P(w;) = P(w2). In the 3-dimensional case, the grid plane 
separates R from Rə. 


1 
Wi = Hi (52) 
Oo 
and 
if: i 


20? 
We call wio the threshold or bias in the ith direction. 

A classifier that uses linear discriminant functions is called a linear machine. This 
kind of classifier has many interesting theoretical properties, some of which will be 
discussed in detail in Chap. ??. At this point we merely note that the decision 
surfaces for a linear machine are pieces of hyperplanes defined by the linear equations 
gi(x) = gj (x) for the two categories with the highest posterior probabilities. For our 
particular case, this equation can be written as 


w' (x — xo) = 0, (54) 
where 
W=H;=Hj (55) 
and 
1 o? P(w;) 
Xo = 5 (Mi + Hy) In Hi — Hy). 56 
RI Py Me Bo 


This equation defines a hyperplane through the point x9 and orthogonal to the 
vector w. Since w = u; — Hj, the hyperplane separating R; and Rj is orthogonal to 
the line linking the means. If P(w;) = P(w;), the second term on the right of Eq. 56 
vanishes, and thus the point xy is halfway between the means, and the hyperplane is 
the perpendicular bisector of the line between the means (Fig. 2.11). If P(w;) 4 P(w;), 
the point xy shifts away from the more likely mean. Note, however, that if the variance 


THRESHOLD 


BIAS 


LINEAR 
MACHINE 


MINIMUM 
DISTANCE 
CLASSIFIER 


TEMPLATE- 
MATCHING 


22 


CHAPTER 2. BAYESIAN DECISION THEORY 


palo) 


(ON 0, 


TEARS 
AZ 


H 


sj 
N 
y 
À 
y 

N 

N 

N 

y 
y 
y 
N 


Figure 2.11: As the priors are changed, the decision boundary shifts; for sufficiently 
disparate priors the boundary will not lie between the means of these 1-, 2- and 
3-dimensional spherical Gaussian distributions. 


o” is small relative to the squared distance ||u; — j4;||, then the position of the decision 
boundary is relatively insensitive to the exact values of the prior probabilities. 


If the prior probabilities P(w;) are the same for all c classes, then the In P(w;) 
term becomes another unimportant additive constant that can be ignored. When this 
happens, the optimum decision rule can be stated very simply: to classify a feature 
vector x, measure the Euclidean distance ||x — y,|| from each x to each of the c 
mean vectors, and assign x to the category of the nearest mean. Such a classifier is 
called a minimum distance classifier. If each mean vector is thought of as being an 
ideal prototype or template for patterns in its class, then this is essentially a template- 
matching procedure (Fig. 2.10), a technique we will consider again in Chap. ?? Sect. ?? 


on the nearest-neighbor algorithm. 
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2.6.2 Case 2: ©; = 


Another simple case arises when the covariance matrices for all of the classes are 
identical but otherwise arbitrary. Geometrically, this corresponds to the situation in 
which the samples fall in hyperellipsoidal clusters of equal size and shape, the cluster 
for the ith class being centered about the mean vector w;. Since both |2;| and the 
(d/2) ln 27 term in Eq. 47 are independent of i, they can be ignored as superfluous 
additive constants. This simplification leads to the discriminant functions 


gil) = -FC m) E a) + In Plus) (57) 


If the prior probabilities P(w;) are the same for all c classes, then the In P(w;) 
term can be ignored. In this case, the optimal decision rule can once again be stated 
very simply: to classify a feature vector x, measure the squared Mahalanobis distance 
(x — 1, Ex — u;) from x to each of the c mean vectors, and assign x to the 
category of the nearest mean. As before, unequal prior probabilities bias the decision 
in favor of the a priori more likely category. 

Expansion of the quadratic form (x — y;)'~1(x — p,) results in a sum involving 
a quadratic term x"N”!x which here is independent of i. After this term is dropped 
from Eq. 57, the resulting discriminant functions are again linear: 


gi(x) = wix + wio, (58) 
where 
and 
1 


Since the discriminants are linear, the resulting decision boundaries are again 
hyperplanes (Fig. 2.10). If R; and R; are contiguous, the boundary between them 
has the equation 


w' (x — xo) = 0, (61) 
where 


w= D~! (p; — Hj) (62) 


and 


In [P(w;)/P(w;)) 
(1, — uj) E(u; = Lj) 


(1, lj). (63) 


Since w = X~! (p; — p3) is generally not in the direction of p; — u;, the hyperplane 
separating R; and Rj is generally not orthogonal to the line between the means. 
However, it does intersect that line at the point xy which is halfway between the 
means if the prior probabilities are equal. If the prior probabilities are not equal, the 
optimal boundary hyperplane is shifted away from the more likely mean (Fig. 2.12). 
As before, with sufficient bias the decision plane need not lie between the two mean 
vectors. 
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Figure 2.12: Probability densities (indicated by the surfaces in two dimensions and 
ellipsoidal surfaces in three dimensions) and decision regions for equal but asymmetric 
Gaussian distributions. The decision hyperplanes need not be perpendicular to the 


line connecting the means. 
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2.6.3 Case 3: X; = arbitrary 


In the general multivariate normal case, the covariance matrices are different for each 
category. The only term that can be dropped from Eq. 47 is the (d/2) In 27 term, 
and the resulting discriminant functions are inherently quadratic: 


gi (x) = x’ W;x + wix + Wi0; (64) 
where 
1 -1 
Wi = 535", (65) 
wi = Dp; (66) 
and 
nee 1 


The decision surfaces are hyperquadrics, and can assume any of the general forms 
— hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, 
and hyperhyperboloids of various types (Problem 29). Even in one dimension, for 
arbitrary covariance the decision regions need not be simply connected (Fig. 2.13). 
The two- and three-dimensional examples in Fig. 2.14 & 2.15 indicate how these 
different forms can arise. These variances are indicated by the contours of constant 
probability density. 

The extension of these results to more than two categories is straightforward 
though we need to keep clear which two of the total c categories are responsible for 
any boundary segment. Figure 2.16 shows the decision surfaces for a four-category 
case made up of Gaussian distributions. Of course, if the distributions are more com- 
plicated, the decision regions can be even more complex, though the same underlying 
theory holds there too. 


P(x|@;) 


R R, R 


Figure 2.13: Non-simply connected decision regions can arise in one dimensions for 
Gaussians having unequal variance. 


HYPER- 
QUADRIC 


26 CHAPTER 2. BAYESIAN DECISION THEORY 


Figure 2.14: Arbitrary Gaussian distributions lead to Bayes decision boundaries that 
are general hyperquadrics. Conversely, given any hyperquadratic, one can find two 
Gaussian distributions whose Bayes decision boundary is that hyperquadric. 
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Figure 2.15: Arbitrary three-dimensional Gaussian distributions yield Bayes decision 
boundaries that are two-dimensional hyperquadrics. There are even degenerate cases 
in which the decision boundary is a line. 
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Figure 2.16: The decision regions for four normal distributions. Even with such a low 
number of categories, the shapes of the boundary regions can be rather complex. 
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Example 1: Decision regions for two-dimensional Gaussian data | 


To clarify these ideas, we explicitly calculate the decision boundary for the two- 
category two-dimensional data in the Example figure. Let wı be the set of the four 
black points, and wa the red points. Although we will spend much of the next chapter 
understanding how to estimate the parameters of our distributions, for now we simply 
assume that we need merely calculate the means and covariances by the discrete 
versions of Eqs. 39 & 40; they are found to be: 


3 1/2 0 3 2 
m=| 6 |) 2 =( s) ana 2 ze 


The inverse matrices are then, 


ee A ana e ae 


We assume equal prior probabilities, P(w1) = P(w2) = 0.5, and substitute these into 
the general form for a discriminant, Eqs. 64 — 67, setting g1(x) = g2(x) to obtain the 
decision boundary: 


T2 = 3.514 — 1.1252, + 0.187527. 


This equation describes a parabola with vertex at ae Note that despite the 
fact that the variance in the data along the xə direction for both distributions is the 
same, the decision boundary does not pass through the point (3), midway between 
the means, as we might have naively guessed. This is because for the w1 distribution, 
the probability distribution is “squeezed” in the x;-direction more so than for the w2 
distribution. Because the overall prior probabilities are the same (i.e., the integral over 
space of the probability density), the distribution is increased along the xo direction 
(relative to that for the wa distribution). Thus the decision boundary lies slightly 
lower than the point midway between the two means, as can be seen in the decision 
boundary. 


The computed Bayes decision boundary for two Gaussian distributions, each based 
on four data points. 
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2.7 Error Probabilities and Integrals 


We can obtain additional insight into the operation of a general classifier — Bayes or 
otherwise — if we consider the sources of its error. Consider first the two-category 
case, and suppose the dichotomizer has divided the space into two regions Ry and Ra 
in a possibly non-optimal way. There are two ways in which a classification error can 
occur; either an observation x falls in Rə and the true state of nature is w1, or x falls 
in Rı and the true state of nature is w2. Since these events are mutually exclusive 
and exhaustive, the probability of error is 


Plerror) = P(x € R2,01) + P(x € Ri, we) 
P(x € Re|w1)P(wi) + P(x € Ri|we)P (we) 


= Jroslar)P laa) dx + [oester)Plwr) dx. (68) 


Ra Ri 


This result is illustrated in the one-dimensional case in Fig. 2.17. The two in- 
tegrals in Eq. 68 represent the pink and the gray areas in the tails of the functions 
p(x|w;)P(w;). Because the decision point x* (and hence the regions R; and R2) were 
chosen arbitrarily for that figure, the probability of error is not as small as it might 
be. In particular, the triangular area marked “reducible error” can be eliminated if 
the decision boundary is moved to zg. This is the Bayes optimal decision boundary 
and gives the lowest probability of error. In general, if p(xJw1)P(w1) > p(xlw2)P (w2), 
it is advantageous to classify x as in Rı so that the smaller quantity will contribute 
to the error integral; this is exactly what the Bayes decision rule achieves. 
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Figure 2.17: Components of the probability of error for equal priors and (non-optimal) 
decision point x*. The pink area corresponds to the probability of errors for deciding 
w, when the state of nature is in fact wa; the gray area represents the converse, as 
given in Eq. 68. If the decision boundary is instead at the point of equal posterior 
probabilities, xg, then this reducible error is eliminated and the total shaded area is 
the minimum possible — this is the Bayes decision and gives the Bayes error rate. 


In the multicategory case, there are more ways to be wrong than to be right, and 
it is simpler to compute the probability of being correct. Clearly 


P(correct) = X P(x € Ri, wi) 


i=l 
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II 


Y" P(x € Rilwi)P(wi) 


4=1 


= Sf oxla) Pt) dx. (69) 
i=l. 


The general result of Eq. 69 depends neither on how the feature space is partitioned 
into decision regions nor on the form of the underlying distributions. The Bayes 
classifier maximizes this probability by choosing the regions so that the integrand is 
maximal for all x; no other partitioning can yield a smaller probability of error. 


2.8 Error Bounds for Normal Densities 


The Bayes decision rule guarantees the lowest average error rate, and we have seen 
how to calculate the decision boundaries for normal densities. However, these results 
do not tell us what the probability of error actually is. The full calculation of the error 
for the Gaussian case would be quite difficult, especially in high dimensions, because 
of the discontinuous nature of the decision regions in the integral in Eq. 69. However, 
in the two-category case the general error integral of Eq. 5 can be approximated 
analytically to give us an upper bound on the error. 


2.8.1 Chernoff Bound 


To derive a bound for the error, we need the following inequality: 


minja,b] < afb? for a,b>0and0<6 <1. (70) 


To understand this inequality we can, without loss of generality, assume a > b. Thus 
we need only show that b < a®b!~8 = (24 b. But this inequality is manifestly valid, 
since (2) > 1. Using Eqs. 7 & 1, we apply this inequality to Eq. 5 and get the bound: 


P(error) < PP(wm1)P 4 (w2) J Piton Pl) dx for0<6<1. (71) 


Note especially that this integral is over all feature space — we do not need to impose 
integration limits corresponding to decision boundaries. 

If the conditional probabilities are normal, the integral in Eq. 71 can be evaluated 
analytically (Problem 35), yielding: 


J pP (xun) pt. (xwa) dx = 40) (72) 


where 


(8) = EEA o 1881 + (1 A2] a- 4) + 
1 |821 + (1 = 8) Bol 


2 APE 


(73) 


The graph in Fig. 2.18 shows a typical example of how e~*\%) varies with 8. The 
Chernoff bound, on P(error) is found by analytically or numerically finding the value 
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of 8 that minimizes e~*(, and substituting the results in Eq. 71. The key benefit 
here is that this optimization is in the one-dimensional 8 space, despite the fact that 
the distributions themselves might be in a space of arbitrarily high dimension. 


Chernoff bound I i 
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Figure 2.18: The Chernoff error bound is never looser than the Bhattacharyya bound. 
For this example, the Chernoff bound happens to be at P* = 0.66, and is slightly 
tighter than the Bhattacharyya bound (8 = 0.5). 


2.8.2 Bhattacharyya Bound 


The general dependence of the Chernoff bound upon 8 shown in Fig. 2.18 is typical 
of a wide range of problems — the bound is loose for extreme values (i.e., 8 — 1 and 
3 — 0), and tighter for intermediate ones. While the precise value of the optimal 
B depends upon the parameters of the distributions and the prior probabilities, a 
computationally simpler, but slightly less tight bound can be derived by merely asing 
the results for 6 = 1/2. This result is the so-called Bhattacharyya bound on the error, 
where Eq. 71 then has the form 


Plerror) < VPP) | Vok) dx 
= vyP(w)P(wa)e "t2, (74) 


where by Eq. 73 we have for the Gaussian case: 


>; + Ya =1 
(1/2) = 1/8(09- 4) 22] 19-121) + 
1 | 211E» 
=ln ===. (75) 
2 yl2i [La] 

The Chernoff and Bhatacharyya bounds may still be used even if the underlying 
distributions are not Gaussian. However, for distributions that deviate markedly from 

a Gaussian, the bounds will not be informative (Problem 32). 
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Example 2: Error bounds for Gaussian distributions. | 


It is a straightforward matter to calculate the Bhattacharyya bound for the two- 
dimensional data sets of Example 1. Substituting the means and covariances of Exam- 
ple 1 into Eq. 75 we find k(1/2) = 4.11 and thus by Eqs. 74 & 75 the Bhattacharyya 
bound on the error is P(error) < 0.016382. 

A tighter bound on the error can be approximated by searching numerically for the 
Chernoff bound of Eq. 73, which for this problem gives 0.016380. One can get the best 
estimate by numerically integrating the error rate directly Eq. 5, which gives 0.0021, 
and thus the bounds here are not particularly tight. Such numerical integration is 
often impractical for Gaussians in higher than two or three dimensions. 


2.8.3 Signal Detection Theory and Operating Characteristics 


Another measure of distance between two Gaussian distributions has found great 
use in experimental psychology, radar detection and other fields. Suppose we are 
interested in detecting a single weak pulse, such as a dim flash of light or a weak 
radar reflection. Our model is, then, that at some point in the detector there is an 
internal signal (such as a voltage) x, whose value has mean ¡uz when the external signal 
(pulse) is present, and mean ¡11 when it is not present. Because of random noise — 
within and outside the detector itself — the actual value is a random variable. We 
assume the distributions are normal with different means but the same variance, i.e., 
p(xlw;) ~ N(us, o°), as shown in Fig. 2.19. 


pelo) 


Figure 2.19: During any instant when no external pulse is present, the probability 
density for an internal signal is normal, i.e., p(a|w1) ~ N(u1,0?); when the external 
signal is present, the density is p(a|w2) ~ N(u2,0?). Any decision threshold 2* will 
determine the probability of a hit (the red area under the wa curve, above x*) and of 
a false alarm (the black area under the w, curve, above 2*). 


The detector (classifier) employs a threshold value x* for determining whether the 
external pulse is present, but suppose we, as experimenters, do not have access to this 
value (nor to the means and standard deviations of the distributions). We seek to 
find some measure of the ease of discriminating whether the pulse is present or not, in 
a form independent of the choice of x*. Such a measure is the discriminability, which 
describes the inherent and unchangeable properties due to noise and the strength of 
the external signal, but not on the decision strategy (i.e., the actual choice of x*). 
This discriminability is defined as 


DISCRIMIN- 
ABILITY 


RECEIVER 
OPERATING 
CHARACTER- 
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d' = [u2 — pa] 
— š 


(76) 


A high d’ is of course desirable. 

While we do not know 11, u2, a nor x*, we assume here that we know the state 
of nature and the decision of the system. Such information allows us to find d’. To 
this end, we consider the following four probabilities: 


e P(x > x*|x € w2): a hit — the probability that the internal signal is above a* 
given that the external signal is present 


e P(x > x* |x € wy): a false alarm — the probability that the internal signal is 
above x* despite there being no external signal is present 


e P(x < 2*|x € wa): a miss — the probability that the internal signal is below 2* 
given that the external signal is present 


e Plx<a*lx € w 1): a correct rejection — the probability that the internal signal 
is below «* given that the external signal is not present. 


If we have a large number of trials (and we can assume x” is fixed, albeit at an 
unknown value), we can determine these probabilities experimentally, in particular 
the hit and false alarm rates. We plot a point representing these rates on a two- 
dimensional graph. If the densities are fixed but the threshold x* is changed, then our 
hit and false alarm rates will also change. Thus we see that for a given discriminability 
d', our point will move along a smooth curve — a receiver operating characteristic or 
ROC curve (Fig. 2.20). 
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Figure 2.20: In a receiver operating characteristic (ROC) curve, the abscissa is the 
probability of false alarm, P(x > x*|x € 1), and the ordinate the probability of hit, 
P(x > x*lx € w2). From the measured hit and false alarm rates (here corresponding 
to x* in Fig. 2.19, and shown as the red dot), we can deduce that d’ = 3. 


The great benefit of this signal detection framework is that we can distinguish 
operationally between discriminability and decision bias — while the former is an 
inherent property of the detector system, the latter is due to the receiver’s implied 
but changeable loss matrix. Through any pair of hit and false alarm rates passes 
one and only one ROC curve; thus, so long as neither rate is exactly 0 or 1, we 
can determine the discriminability from these rates (Problem 38). Moreover, if the 
Gaussian assumption holds, a determination of the discriminability (from an arbitrary 
x*) allows us to calculate the Bayes error rate — the most important property of any 
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classifier. If the actual error rate differs from the Bayes rate inferred in this way, we 
should alter the threshold x* accordingly. 


It is a simple matter to generalize the above discussion and apply it to two cate- 
gories having arbitrary multidimensional distributions, Gaussian or not. Suppose we 
have two distributions p(x|w ,) and p(x|w2) which overlap, and thus have non-zero 
Bayes classification error. Just as we saw above, any pattern actually from wa could 
be properly classified as wa (a “hit”) or misclassified as w (a “false alarm”). Unlike 
the one-dimensional case above, however, there may be many decision boundaries 
that give a particular hit rate, each with a different false alarm rate. Clearly here we 
cannot determine a fundamental measure of discriminability without knowing more 
about the underlying decision rule than just the hit and false alarm rates. 


In a rarely attainable ideal, we can imagine that our measured hit and false alarm 
rates are optimal, for example that of all the decision rules giving the measured hit 
rate, the rule that is actually used is the one having the minimum false alarm rate. 
If we constructed a multidimensional classifier — regardless of the distributions used 
— we might try to characterize the problem in this way, though it would probably 
require great computational resources to search for such optimal hit and false alarm 
rates. 


In practice, instead we eschew optimality, and simply vary a single parameter 
controlling the decision rule and plot the resulting hit and false alarm rates — a 
curve called merely an operating characteristic. Such a control parameter might be 
the bias or nonlinearity in a discriminant function. It is traditional to choose a 
control parameter that can yield, at extreme values, either a vanishing false alarm 
or a vanishing hit rate, just as can be achieved with a very large or a very small x* 
in an ROC curve. We should note that since the distributions can be arbitrary, the 
operating characteristic need not be symmetric (Fig. 2.21); in rare cases it need not 
even be concave down at all points. 
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Figure 2.21: In a general operating characteristic curve, the abscissa is the probability 
of false alarm, P(x € Relax € 1), and the ordinate the probability of hit, P(x € 
Reale € wa). As illustrated here, operating characteristic curves are generally not 
symmetric, as shown at the right. 


Classifier operating curves are of value for problems where the loss matrix Aij 
might be changed. If the operating characteristic has been determined as a function 
of the control parameter ahead of time, it is a simple matter, when faced with a new 
loss function, to deduce the control parameter setting that will minimize the expected 
risk (Problem 38). 
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2.9 Bayes Decision Theory — Discrete Features 


Until now we have assumed that the feature vector x could be any point in a d- 
dimensional Euclidean space, R. However, in many practical applications the com- 
ponents of x are binary-, ternary-, or higher integer valued, so that x can assume only 
one of m discrete values v1,...,Vm- In such cases, the probability density function 
p(x|w,;) becomes singular; integrals of the form 


J p(x|w;) dx (77) 


must then be replaced by corresponding sums, such as 
XO P(x|w;), (78) 


where we understand that the summation is over all values of x in the discrete 
distribution.* Bayes’ formula then involves probabilities, rather than probability den- 
sities: 


P(xlwj)P(w;) 


Puso) = 


(79) 


where 


Ce 


P(x) = $ P(x|wj) P(wy). (80) 


j=l 

The definition of the conditional risk R(a|x) is unchanged, and the fundamental 

Bayes decision rule remains the same: To minimize the overall risk, select the action 
a for which R(a;|x) is minimum, or stated formally, 


a* = arg max R(a;|x). (81) 


The basic rule to minimize the error-rate by maximizing the posterior probability is 
also unchanged as are the discriminant functions of Eqs. 25 — 27, given the obvious 
replacement of densities p(-) by probabilities P(-). 


2.9.1 Independent Binary Features 


As an example of a classification involving discrete features, consider the two-category 
problem in which the components of the feature vector are binary-valued and condi- 
tionally independent. To be more specific we let x = (x1,..., 14)*, where the compo- 
nents x; are either 0 or 1, with 


pi = Prob (z; = 1|w1) (82) 


and 


qi = Prob (a; = 1|we). (83) 


* Technically speaking, Eq. 78 should be written as yA P(v;,|wj) where P(v;,|w,;) is the conditional 
probability that x = vz given that the state of nature is wj. 
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This is a model of a classification problem in which each feature gives us a yes/no 
answer about the pattern. If p; > q;, we expect the ¿th feature to give a “yes” answer 
more frequently when the state of nature is wı than when when it is wz. (As an 
example, consider two factories each making the same automobile, each of whose d 
components could be functional or defective. If it was known how the factories differed 
in their reliabilities for making each component, then this model could be used to judge 
which factory manufactured a given automobile based on the knowledge of which 
features are functional and which defective.) By assuming conditional independence 
we can write P(x|w;) as the product of the probabilities for the components of x. 
Given this assumption, a particularly convenient way of writing the class-conditional 
probabilities is as follows: 


d 
P(xlw1) = [Te =a (84) 
and 
d 
P(x|w2) = [Tara =a (85) 


Then the likelihood ratio is given by 


d 


P(x|w ¡y Yi (1 — py\1-%i 
Peleg UE) = Pa 
and consequently Eq. 30 yields the discriminant function 
“ Pi 1— pi P(w1) 
g(x) = 2, E ln a + (1—a;) ln = a + In Peay (87) 


We note especially that this discriminant function is linear in the x; and thus we can 
write 


d 
g(x) = 5 Wit; + Wo, (88) 
i=l 
where 
Wi = In t= Lis d 89 
qi(1 — pi) ) 
and 
So ip, Pu) 
wo =Y In e E, 90 
ý >, l—qi P(wa) (90) 


Let us examine these results to see what insight they can give. Recall first that 
we decide w if g(x) > 0 and we if g(x) < 0. We have seen that g(x) is a weighted 
combination of the components of x. The magnitude of the weight w; indicates the 
relevance of a “yes” answer for x; in determining the classification. If p; = qi, x; gives 
us no information about the state of nature, and w; = 0, just as we might expect. 
If pi > qi, then 1 — p; < 1 — qi and w; is positive. Thus in this case a “yes” answer 
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for x; contributes w; votes for w;. Furthermore, for any fixed q; < 1, w; gets larger 
as p; gets larger. On the other hand, if p; < q;, w; is negative and a “yes” answer 
contributes |w;| votes for wa. 

The condition of feature independence leads to a very simple (linear) classifier; 
of course if the features were not independent, a more complicated classifier would 
be needed. We shall come across this again for systems with continuous features in 
Chap. ??, but note here that the more independent we can make the features, the 
simpler the classifier can be. 

The prior probabilities P(w;) appear in the discriminant only through the thresh- 
old weight wo. Increasing P(w,) increases wo and biases the decision in favor of w1, 
whereas decreasing P(w,) has the opposite effect. Geometrically, the possible values 
for x appear as the vertices of a d-dimensional hypercube; the decision surface defined 
by g(x) = 0 is a hyperplane that separates w; vertices from wa vertices. 


Example 3: Bayesian decisions for three-dimensional binary features | 


Suppose two categories consist of independent binary features in three dimensions 
with known feature probabilities. Let us construct the Bayesian decision boundary if 
P(w1) = P(w2) = 0.5 and the individual components obey: 


Ta i=1,2,3. 


By Eqs. 89 & 90 we have that the weights are 


and the bias value is 


The decision boundary for the Example involving three-dimensional binary features. 
On the left we show the case p; = .8 and q; = .5. On the right we use the same values 
except p3 = q3, which leads to w3 = 0 and a decision surface parallel to the 73 axis. 
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The surface g(x) = 0 from Eq. 88 is shown on the left of the figure. Indeed, as we 
might have expected, the boundary places points with two or more “yes” answers into 
category w 1, since that category has a higher probability of having any feature have 
value 1. 

Suppose instead that while the prior probabilities remained the same, our individ- 
ual components obeyed: 


pı = p2 =0.8, p3 =0.5 
qı = Q2 = q3 = 0.5 


In this case feature x3 gives us no predictive information about the categories, and 
hence the decision boundary is parallel to the x3 axis. Note that in this discrete case 
there is a large range in positions of the decision boundary that leaves the categoriza- 
tion unchanged, as is particularly clear in the figure on the right. 


2.10 Missing and Noisy Features 


If we know the full probability structure of a problem, we can construct the (optimal) 
Bayes decision rule. Suppose we develop a Bayes classifier using uncorrupted data, 
but our input (test) data are then corrupted in particular known ways. How can we 
classify such corrupted inputs to obtain a minimum error now? 

There are two analytically solvable cases of particular interest: when some of the 
features are missing, and when they are corrupted by a noise source with known 
properties. In each case our basic approach is to recover as much information about 
the underlying distribution as possible and use the Bayes decision rule. 


2.10.1 Missing Features 


Suppose we have a Bayesian (or other) recognizer for a problem using two features, 
but that for a particular pattern to be classified, one of the features is missing.* For 
example, we can easily imagine that the lightness can be measured from a portion of 
a fish, but the width cannot because of occlusion by another fish. 

We can illustrate with four categories a somewhat more general case (Fig. 2.22). 
Suppose for a particular test pattern the feature xı is missing, and the measured value 
of xa is 2. Clearly if we assume the missing value is the mean of all the x, values, 
i.e., 71, we will classify the pattern as w3. However, if the priors are equal, wa would 
be a better decision, since the figure implies that p(#2|w2) is the largest of the four 
likelihoods. 

To clarify our derivation we let x = [x,,x»], where x, represents the known or 
“good” features and x, represents the “bad” ones, i.e., either unknown or missing. We 
seek the Bayes rule given the good features, and for that the posterior probabilities 
are needed. In terms of the good features the posteriors are 


plwi, Xg) Z J p(w;, Xg, Xv) dX; 
P(Xq) p(xy) 


* In practice, just determining that the feature is in fact missing rather than having a value of zero 
(or the mean value) can be difficult in itself. 


Plulxg) = 


MARGINAL 
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Figure 2.22: Four categories have equal priors and the class-conditional distributions 
shown. If a test point is presented in which one feature is missing (here, xı) and the 
other is measured to have value #2 (red dashed line), we want our classifier to classify 
the pattern as category wa, because p(ĉ2|w2) is the largest of the four likelihoods. 


J P@ilXg, xo)p(Xg, Xb) dxo 
P(Xq) 
J gi(x)p(x) dx, 
~ Jod at) 


where 9; (x) = 9i(X ,X») = P(w;|x,, Xb) is one form of our discriminant function. 

We refer to f p(wi,Xg,X») dx», as a marginal distribution; we say the full joint 
distribution is marginalized over the variable xp. In short, Eq. 91 shows that we must 
integrate (marginalize) the posterior probability over the bad features. Finally we 
use the Bayes decision rule on the resulting posterior probabilities, i.e., choose w; if 
P(w;|x,) > P(w;|xg) for all i and j. We shall consider the Expectation-Maximization 
(EM) algorithm in Chap. ??, which addresses a related problem involving missing 
features. 


2.10.2 Noisy Features 


It is a simple matter to generalize the results of Eq. 91 to the case where a particular 
feature has been corrupted by statistically independent noise.* For instance, in our 
fish classification example, we might have a reliable measurement of the length, while 
variability of the light source might degrade the measurement of the lightness. We 
assume we have uncorrupted (good) features x,, as before, and a noise model, ex- 
pressed as p(xp|x,). Here we let x; denote the true value of the observed x, features, 
i.e., without the noise present; that is, the x, are observed instead of the true x;. We 
assume that if x; were known, x, would be independent of w; and xg. From such an 
assumption we get: 


as > > d 
P(wilxg, xo) _ J pw Xg, Xb x) Xe (92) 
p(Xg, Xb) 


* Of course, to tell the classifier that a feature value is missing, the feature extractor must be designed 
to provide more than just a numerical value for each feature. 
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Now p(w;, Xg, Xb, Xt) = P(wi|Xg, Xv, Xt)P(Xg, Xb, Xz), but by our independence assump- 
tion, if we know x+, then x, does not provide any additional information about wi. 
Thus we have P(w;|xy, Xp, Xx+) = Plw;|xg, xs). Similarly, we have p(x,,x»,x:) = 
p(Xo|Xg,Xt)P(Xg, Xz), and p(xp|xX,, Xt) = p(Xp|xz). We put these together and thereby 
obtain 


J P(wilxg, Xt)P(Xg, x0)p(x0 1x1) dx: 
oe si (xp|xz) dx: 
f glx p(Xo|Xt) dx, 


~ EA S ) dx: ? a 


Pu Xp) 


which we use as discriminant functions for classification in the manner dictated by 
Bayes. 

Equation 93 differs from Eq. 91 solely by the fact that the integral is weighted 
by the noise model. In the extreme case where p(xp|x;) is uniform over the entire 
space (and hence provides no predictive information for categorization), the equation 
reduces to the case of missing features — a satisfying result. 


2.11 Compound Bayesian Decision Theory and Con- 
text 


Let us reconsider our introductory example of designing a classifier to sort two types 
of fish. Our original assumption was that the sequence of types of fish was so unpre- 
dictable that the state of nature looked like a random variable. Without abandoning 
this attitude, let us consider the possibility that the consecutive states of nature might 
not be statistically independent. We should be able to exploit such statistical depen- 
dence to gain improved performance. This is one example of the use of context to aid 
decision making. 

The way in which we exploit such context information is somewhat different when 
we can wait for n fish to emerge and then make all n decisions jointly than when 
we must decide as each fish emerges. The first problem is a compound decision prob- 
lem, and the second is a sequential compound decision problem. The former case is 
conceptually simpler, and is the one we shall examine here. 

To state the general problem, let w = (w(1),...,w(n))? be a vector denoting the n 
states of nature, with w(i) taking on one of the c values w1,...,wWe. Let P(w) be the 
prior probability for the n states of nature. Let X = (x1,...,Xn) be a matrix giving 
the n observed feature vectors, with x; being the feature vector obtained when the 
state of nature was w(i). Finally, let p(X|w) be the conditional probability density 
function for X given the true set of states of nature w. Using this notation we see 
that the posterior probability of w is given by 


prox) - XPE) __ PX) Pw) 
p(X) Vw P(X|w)P(w) 

In general, one can define a loss matrix for the compound decision problem and 
seek a decision rule that minimizes the compound risk. The development of this 
theory parallels our discussion for the simple decision problem, and concludes that 
the optimal procedure is to minimize the compound conditional risk. In particular, if 
there is no loss for being correct, and if all errors are equally costly, then the procedure 


(94) 
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reduces to computing P(w|X) for all w and selecting the w for which this posterior 
probability is maximum. 

While this provides the theoretical solution, in practice the computation of P(w|X) 
can easily prove to be an enormous task. If each component w(i) can have one of 
c values, there are c” possible values of w to consider. Some simplification can be 
obtained if the distribution of the feature vector x; depends only on the corresponding 
state of nature w(i), not on the values of the other feature vectors or the other states of 
nature. In this case the joint density p(X|w) is merely the product of the component 
densities p(x;|w(i)): 


n 
p(X hw) =] | pelo). (95) 

i=1 
While this simplifies the problem of computing p(X|w), there is still the problem 
of computing the prior probabilities P(w). This joint probability is central to the 
compound Bayes decision problem, since it reflects the interdependence of the states 
of nature. Thus it is unacceptable to simplify the problem of calculating P(w) by 
assuming that the states of nature are independent. In addition, practical applications 
usually require some method of avoiding the computation of P(w|X) for all c” possible 

values of w. We shall find some solutions to this problem in Chap. ??. 


Summary 


The basic ideas underlying Bayes decision theory are very simple. To minimize the 
overall risk, one should always choose the action that minimizes the conditional risk 
R(a|x). In particular, to minimize the probability of error in a classification problem, 
one should always choose the state of nature that maximizes the posterior probability 
P(w,|x). Bayes’ formula allows us to calculate such probabilities from the prior prob- 
abilities P(w,;) and the conditional densities p(x|w,;). If there are different penalties 
for misclassifying patterns from w; as if from wj, the posteriors must be first weighted 
according to such penalties before taking action. 

If the underlying distributions are multivariate Gaussian, the decision boundaries 
will be hyperquadrics, whose form and position depends upon the prior probabilities, 
means and covariances of the distributions in question. The true expected error 
can be bounded above by the Chernoff and computationally simpler Bhattacharyya 
bounds. If an input (test) pattern has missing or corrupted features, we should form 
the marginal distributions by integrating over such features, and then using Bayes 
decision procedure on the resulting distributions. Receiver operating characteristic 
curves describe the inherent and unchangeable properties of a classifier and can be 
used, for example, to determine the Bayes rate. 

For many pattern classification applications, the chief problem in applying these 
results is that the conditional densities p(x|w,;) are not known. In some cases we may 
know the form these densities assume, but may not know characterizing parameter 
values. The classic case occurs when the densities are known to be, or can assumed 
to be multivariate normal, but the values of the mean vectors and the covariance 
matrices are not known. More commonly even less is known about the conditional 
densities, and procedures that are less sensitive to specific assumptions about the 
densities must be used. Most of the remainder of this book will be devoted to various 
procedures that have been developed to attack such problems. 
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Bibliographical and Historical Remarks 


The power, coherence and elegance of Bayesian theory in pattern recognition make 
it among the most beautiful formalisms in science. Its foundations go back to Bayes 
himself, of course [3], but he stated his theorem (Eq. 1) for the case of uniform 
priors. It was Laplace [25] who first stated it for the more general (but discrete) case. 
There are several modern and clear descriptions of the ideas — in pattern recognition 
and general decision theory — that can be recommended [7, 6, 26, 15, 13, 20, 27]. 
Since Bayesian theory rests on an axiomatic foundation, it is guaranteed to have 
quantitative coherence; some other classification methods do not. Wald presents a 
non-Bayesian perspective on these topics that can be highly recommended [36], and 
the philosophical foundations of Bayesian and non-Bayesian methods are explored in 
[16]. Neyman and Pearson provided some of the most important pioneering work 
in hypothesis testing, and used the probability of error as the criterion [28]; Wald 
extended this work by introducing the notions of loss and risk [35]. Certain conceptual 
problems have always attended the use of loss functions and prior probabilities. In 
fact, the Bayesian approach is avoided by many statisticians, partly because there are 
problems for which a decision is made only once, and partly because there may be no 
reasonable way to determine the prior probabilities. Neither of these difficulties seems 
to present a serious drawback in typical pattern recognition applications: for nearly 
all critical pattern recognition problems we will have training data; we will use our 
recognizer more than once. For these reasons, the Bayesian approach will continue 
to be of great use in pattern recognition. The single most important drawback of the 
Bayesian approach is its assumption that the true probability distributions for the 
problem can be represented by the classifier, for instance the true distributions are 
Gaussian, and all that is unknown are parameters describing these Gaussians. This 
is a strong assumption that is not always fulfilled and we shall later consider other 
approaches that do not have this requirement. 


Chow[10] was among the earliest to use Bayesian decision theory for pattern recog- 
nition, and he later established fundamental relations between error and reject rate 
[11]. Error rates for Gaussians have been explored by [18], and the Chernoff and 
Bhattacharyya bounds were first presented in [9, 8], respectively and are explored in 
a number of statistics texts, such as [17]. Computational approximations for bound- 
ing integrals for Bayesian probability of error (the source for one of the homework 
problems) appears in [2]. Neyman and Pearson also worked on classification given 
constraints [28], and the analysis of minimax estimators for multivariate normals is 
presented in [5, 4, 14]. Signal detection theory and receiver operating characteristics 
are fully explored in [21]; a brief overview, targetting experimental psychologists, is 
[34]. Our discussion of the missing feature problem follows closely the work of [1] while 
the definitive book on missing features, including a great deal beyond our discussion 
here, can be found in [30]. 


Entropy was the central concept in the foundation of information theory [31] and 
the relation of Gaussians to entropy is explored in [33]. Readers requiring a review of 
information theory [12], linear algebra [24, 23], calculus and continuous mathematics, 
[38, 32] probability [29] calculus of variations and Lagrange multipliers [19] should 
consult these texts and those listed in our Appendix. 
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Problems 


Q Section 2.1 


1. In the two-category case, under the Bayes’ decision rule the conditional error 
is given by Eq. 7. Even if the posterior densities are continuous, this form of the 
conditional error virtually always leads to a discontinuous integrand when calculating 
the full error by Eq. 5. 


(a) Show that for arbitrary densities, we can replace Eq. 7 by P(error|x) = 2P(w |x) P(w2|x) 
in the integral and get an upper bound on the full error. 


(b) Show that if we use P(error|x) = aP(w |x) P(w2|x) for a < 2, then we are not 
guaranteed that the integral gives an upper bound on the error. 


(c) Analogously, show that we can use instead P(error|x) = P(w,|x)P(we|x) and 
get a lower bound on the full error. 


(d) Show that if we use P(error|x) = GP(w |x) P(w|x) for 6 > 1, then we are not 


guaranteed that the integral gives an lower bound on the error. 


Q Section 2.2 


2. Consider minimax criterion for the zero-one loss function, i.e., A11 = A22 = 0 and 


(a) Prove that in this case the decision regions will satisfy 


J vestoryax= f plxkoa)ax 


Ra Ri 
(b) Is this solution always unique? If not, construct a simple counterexample. 
3. Consider the minimax criterion for a two-category classification problem. 


(a) Fill in the steps of the derivation of Eq. 22. 


(b) Explain why the overall Bayes risk must be concave down as a function of the 
prior P(w,), as shown in Fig. 2.4. 


(c) Assume we have one-dimensional Gaussian distributions p(2|w;) ~ N(wi,o?), 
i = 1,2 but completely unknown prior probabilities. Use the minimax criterion 
to find the optimal decision point «* in terms of u; and g; under a zero-one risk. 


(d) For the decision point x* you found in (??), what is the overall minimax risk? 
Express this risk in terms of an error function erf(-). 


(e) Assume plxlw,) ~ N(0,1) and p(z|w2) ~ N(1/2,1/4), under a zero-one loss. 
Find x* and the overall minimax loss. 


(£) Assume p(xlw,) ~ N(5,1) and p(a|we) ~ N(6,1). Without performing any 
explicit calculations, determine x* for the minimax criterion. Explain your 
reasoning. 
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4. Generalize the minimax decision rule in order to classify patterns from three 
categories having triangle densities as follows: 


ôi — |x — p,|)/6? for |£ — mi| < $; 
p(xlw;)=T (ui, ôi) = { ‘ je wal) a 


where 6; > 0 is the half-width of the distribution (i = 1, 2,3). Assume for convenience 
that 1, < u2 < u3, and make some minor simplifying assumptions about the 6;’s as 
needed, to answer the following: 


(a) In terms of the priors P(w;), means and half-widths, find the optimal decision 
points x} and x3 under a zero-one (categorization) loss. 


(b) Generalize the minimax decision rule to two decision points, ví and x3 for such 
triangular distributions. 


(c) Let {mi, ði} = {0,1}, {.5,.5}, and {1,1}. Find the minimax decision rule (i.e., 
x} and x3) for this case. 


(d) What is the minimax risk? 


5. Consider the Neyman-Pearson criterion for two univariate normal distributions: 
p(xlw;) ~ N(pi,07) and P(w;) = 1/2 for i = 1,2. Assume a zero-one error loss, and 
for convenience a > 11. 


(a) Suppose the maximum acceptable error rate for classifying a pattern that is 
actually in w, as if it were in wa is Ej. Determine the decision boundary in 
terms of the variables given. 


(b) For this boundary, what is the error rate for classifying wa as w1? 
(c) What is the overall error rate under zero-one loss? 


(d) Apply your results to the specific case p(a|w1) ~ N(—1,1) and p(z|w2) ~ N(1,1) 
and E = 0.05. 


(e) Compare your result to the Bayes error rate (i.e., without the Neyman-Pearson 
conditions). 


6. Consider Neyman-Pearson criteria for two Cauchy distributions in one dimension 


1 1 


=—-——.,,__ i=12 
ii 


? 


p(x|wi) 
Assume a zero-one error loss, and for simplicity a2 > a,, the same “width” b, and 
equal priors. 


(a) Suppose the maximum acceptable error rate for classifying a pattern that is 
actually in w, as if it were in wa is E. Determine the decision boundary in 
terms of the variables given. 


(b) For this boundary, what is the error rate for classifying wa as w1? 
(c) What is the overall error rate under zero-one loss? 


(d) Apply your results to the specific case b = 1 and a; = —1, ag = 1 and FE, = 0.1. 
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(e) Compare your result to the Bayes error rate (i.e., without the Neyman-Pearson 
conditions). 


@Section 2.4 


7. Let the conditional densities for a two-category one-dimensional problem be given 
by the Cauchy distribution described in Problem 6. 


(a) By explicit integration, check that the distributions are indeed normalized. 


(b) Assuming P(w,) = P(w2), show that P(w|x) = P(we|x) if x = (a, +a2)/2, i.e., 
the minimum error decision boundary is a point midway between the peaks of 
the two distributions, regardless of b. 


(c) Plot P(w |x) for the case ay = 3, ag = 5 and b = 1. 


(d) How do P(w |x) and P(wa|x) behave as x 00? x too? Explain. 


8. Use the conditional densities given in Problem 6, and assume equal prior proba- 
bilities for the categories. 


(a) Show that the minimum probability of error is given by 


1 1 gji Ga — 01 
P = t j . 
(error) 5 an z5 


(b) Plot this as a function of [az — a1|/b. 


(c) What is the maximum value of P(error) and under which conditions can this 
occur? Explain. 


9. Consider the following decision rule for a two-category one-dimensional problem: 
Decide wy if x > 0; otherwise decide wa. 


(a) Show that the probability of error for this rule is given by 
0 oo 
P(error) = P(w;) J p(xlw1) dx + P(w2 JE xlwa2) dz. 
—oo 0 


(b) By differentiating, show that a necessary condition to minimize P(error) is that 
0 satisfy 


p(O|w1)P(w1) = p(0lwa)P (w2). 


(c) Does this equation define 6 uniquely? 


(d) Give an example where a value of 6 satisfying the equation actually maximizes 
the probability of error. 


10. Consider 


(a) True or false: In a two-category one-dimensional problem with continuous fea- 
ture x, a monotonic transformation of x leave the Bayes error rate unchanged. 


2.11. PROBLEMS 47 


(b) True of false: In a two-category two-dimensional problem with continuous fea- 
ture x, monotonic transformations of both zı and zə leave the Bayes error rate 
unchanged. 


11. Suppose that we replace the deterministic decision function a(x) with a ran- 
domized rule, viz., the probability P(a;|x) of taking action a; upon observing x. 


(a) Show that the resulting risk is given by 


R= J [Y Ria Pla) p(x) dx. 


(b) In addition, show that R is minimized by choosing P(a;|x) = 1 for the action 
a; associated with the minimum conditional risk R(a;|x), thereby showing that 
no benefit can be gained from randomizing the best decision rule. 


(c) Can we benefit from randomizing a suboptimal rule? Explain. 


12. Let wmaz(X) be the state of nature for which P(Wmaz|x) > P(w;|x) for all i, 


(a) Show that P(Wmax|x) > 1/c. 


(b) Show that for the minimum-error-rate decision rule the average probability of 
error is given by 


P(error) = 1 — J P(Wmax|X)p(x) dx. 


(c) Use these two results to show that P(error) < (c— 1)/c. 
(d) Describe a situation for which P(error) = (c— 1)/c. 


13. In many pattern classification problems one has the option either to assign the 
pattern to one of c classes, or to reject it as being unrecognizable. If the cost for 
rejects is not too high, rejection may be a desirable action. Let 


0 i=] 19 = 1.56 
(aj |w;) = Ar i=c+1 
As otherwise, 


where A, is the loss incurred for choosing the (c+ 1)th action, rejection, and As is the 
loss incurred for making a substitution error. Show that the minimum risk is obtained 
if we decide w; if P(w;|x) > P(w;|x) for all j and if P(w;|x) > 1— Ar/As, and reject 
otherwise. What happens if A, = 0? What happens if A, > As? 

14. Consider the classification problem with rejection option. 


(a) Use the results of Problem 13 to show that the following discriminant functions 
are optimal for such problems: 


p(x|wi)P (wi) 4 = 1,65 
gi(x) = dude Y rl) Pu) add, 
J= 
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(b) Plot these discriminant functions and the decision regions for the two-category 


one-dimensional case having 
e p(zljw,) ~ N(1,1), 
e p(z|w2) ~ N(—1,1), 
e P(w) = P(w2) = 1/2, and 
e X,/As = 1/4. 


(c) Describe qualitatively what happens as A,/As is increased from 0 to 1. 


(d) Repeat for the case having 


e p(x|wi) ~ N(1,1), 

e p(z|w2) ~ N(0,1/4), 

e P(w,) = 1/3, P(w2) = 2/3, and 
e X,/As = 1/2. 


section 2.5 


15. Confirm Eq. 45 for the volume of a d-dimensional hypersphere as follows: 


(b 
(c 


(d 


(£ 


) 
) 


) 


) 


Verify that the equation is correct for a line (d = 1). 
Verify that the equation is correct for a disk (d = 2). 


Integrate the volume of a line over appropriate limits to obtain the volume of a 
disk. 


Consider a general d-dimensional hypersphere. Integrate its volume to obtain 
a formula (involving the ratio of gamma functions, P(-)) for the volume of a 
(d + 1)-dimensional hypersphere. 


Apply your formula to find the volume of a hypersphere in an odd-dimensional 
space by integrating the volume of a hypersphere in the lower even-dimensional 
space, and thereby confirm Eq. 45 for odd dimensions. 


Repeat the above but for finding the volume of a hypersphere in even dimensions. 


16. Derive the formula for the volume of a d-dimensional hypersphere in Eq. 45 as 
follows: 


eat 


State by inspection the formula for Vj. 


Follow the general procedure outlined in Problem 15 and integrate twice to find 
Va+2 as a function of Va. 


Assume that the functional form of Vy is the same for all odd dimensions (and 
likewise for all even dimensions). Use your integration results to determine the 
formula for Vq for d odd. 


Use your intermediate integration results to determine Vq for d even. 


Explain why we should expect the functional form of V¿ to be different in even 
and in odd dimensions. 
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17. Derive the formula (Eq. 44) for the volume V of a hyperellipsoid of constant 
Mahalanobis distance r (Eq. 43) for a Gaussian distribution having covariance X. 
18. Consider two normal distributions in one dimension: N(j11,0?) and N(p2, 02). 
Imagine that we choose two random samples x; and x2, one from each of the normal 
distributions and calculate their sum 13 = 11 + 12. Suppose we do this repeatedly. 


(a) Consider the resulting distribution of the values of x3. Show from first principles 
that this is also a normal distribution. 


(b) What is the mean, u3, of your new distribution? 
(c) What is the variance, 03? 


(d) Repeat the above with two distributions in a multi-dimensional space, i.e., 
N(m, 21) and N(u>, Xə). 


19. Starting from the definition of entropy (Eq. 36), derive the general equation for 
the maximum-entropy distribution given constraints expressed in the general form 


J iawa) dr = ay, k=1,2,...,q 
as follows: 


(a) Use Lagrange undetermined multipliers A1, A2,...,Aq and derive the synthetic 
function: 


H, =- | po) fi pe) - > ra) de — Y Aras. 
k=0 =O 


State why we know ay = 1 and bo(x) = 1 for all z. 


(b) Take the derivative of H, with respect to p(x). Equate the integrand to zero, 
and thereby prove that the minimum-entropy distribution obeys 


p(x) = exp Y Abi (x) — ] s 
k=0 


where the q + 1 parameters are determined by the constraint equation. 
20. Use the final result from Problem 19 for the following. 


(a) Suppose we know only that a distribution is non-zero in the range 1, < £ < £u. 
Prove that the maximum entropy distribution is uniform in that range, i.e., 


1a = fil -SESA Ly 
p(x) ~ U(x, tu) = { al | otherwise. 
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(b) Suppose we know only that a distribution is non-zero for x > 0 and that its 
mean is u. Prove that the maximum entropy distribution is 


a ee for x >0 
se al 0 otherwise. 


(c) Now suppose we know solely that the distribution is normalized, has mean p, 
and standard deviation o?, and thus from Problem 19 our maximum entropy 
distribution must be of the form 


p(x) = exp[Ao — 1 + Arz + Ax”). 


Write out the three constraints and solve for Ap, A,, and A2 and thereby prove 
that the maximum entropy solution is a Gaussian, i.e., 


p(z) = en "|. 


21. Three distributions — a Gaussian, a uniform distribution, and a triangle dis- 
tribution (cf., Problem 4) — each have mean zero and standard deviation o?. Use 
Eq. 36 to calculate and compare their entropies. 

22. Calculate the entropy of a multidimensional Gaussian p(x) ~ N(p, ©). 


@Section 2.6 


23. Consider the three-dimensional normal distribution p(x|w) ~ N(pu, ©) where 
1 1 0 0 
w=(2) ande=(>» 5 2). 
(a) Find the probability density at the point xg = (.5, 0, 1)*. 


(b) Construct the whitening transformation Aw. Show your A and $ matrices. 
Next, convert the distribution to one centered on the origin with covariance 
matrix equal to the identity matrix, p(x|w) ~ N(0, ID). 


(c) Apply the same overall transformation to xo to yield a transformed point Xw. 


(d) By explicit calculation, confirm that the Mahalanobis distance from xp to the 
mean p in the original distribution is the same as for x,, to O in the transformed 
distribution. 


(e) Does the probability density remain unchanged under a general linear transfor- 
mation? In other words, is p(xo|N (u, E)) = p(T'xo/N(T'p, T*ST)) for some 
linear transform T? Explain. 


f) Prove that a general whitening transform A,, = BA” when applied to a 
8 8 


Gaussian distribution insures that the final distribution has covariance propor- 
tional to the identity matrix I. Check whether normalization is preserved by the 
transformation. 
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24. Consider the multivariate normal density for which o;; = 0 and oj; = 07, i.e., 
£ = diag(o?, 02, ..., 02). 


(a) Show that the evidence is 


p(x) = Ta ap | EY 


2 


(b) Plot and describe the contours of constant density. 
(c) Write an expression for the Mahalanobis distance from x to p. 


25. Fill in the steps in the derivation from Eq. 57 to Eqs. 58-63. 

26. Let p(x|w;) ~ N(p,, E) for a two-category d-dimensional problem with the 
same covariances but arbitrary means and prior probabilities. Consider the squared 
Mahalanobis distance 


r? = (x m) Y (x — p). 
(a) Show that the gradient of r? is given by 


Vr? = 257! (x — p;). 


(b) Show that at any position on a given line through p; the gradient Vr? points 
in the same direction. Must this direction be parallel to that line? 


(c) Show that Vr? and Vr3 point in opposite directions along the line from p4 to 
H2- 


(d) Show that the optimal separating hyperplane is tangent to the constant prob- 
ability density hyperellipsoids at the point that the separating hyperplane cuts 
the line from p to pp. 


(e) True of False: For a two-category problem involving normal densities with ar- 
bitrary means and covariances, and P(w1) = P(w2) = 1/2, the Bayes decision 
boundary consists of the set of points of equal Mahalanobis distance from the 
respective sample means. Explain. 


27. Suppose we have two normal distributions with the same covariances but different 
means: N(u,, ©) and N(p),, E). In terms of their prior probabilities P(w1) and 
P(w 2), state the condition that the Bayes decision boundary not pass between the 
two means. 

28. Two random variables x and y are called “statistically independent” if p(x, ylw) = 


p(xlw)p(y lu). 


(a) Prove that if x; — p; and x; — pj are statistically independent (for i 4 j) then 
oj; as defined in Eq. 42 is 0. 


(b) Prove that the converse is true for the Gaussian case. 


(c) Show by counterexample that this converse is not true in the general case. 
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29. Consider the Bayes decision boundary for two-category classification in d dimen- 
sions. 


(a) Prove that for any arbitrary hyperquadratic in d dimensions, there exist normal 
distributions p(a|w;) ~ N(u;, E;) and priors P(w;), i = 1,2, that possess this 
hyperquadratic as their Bayes decision boundary. 


(b) Is the above also true if the priors are held fixed and non-zero, e.g., P(w 1) = 
P(w2) = 1/2? 


section 2.7 


30. Let p(xlw;) ~ N(u;,0?) for a two-category one-dimensional problem with 
P(w) = P(w2) = 1/2. 


(a) Show that the minimum probability of error is given by 


a 
P, = — | e™ P du, 
a] 


3 


where a = |u2 — 1 |/(20). 


(b) Use the inequality 


1 T 2 il 2 
P, = — | e* P d< ee? 
i ~ yra 


to show that P, goes to zero as |u2 — uı|/o goes to infinity. 


31. Let p(x|w;) ~ N(u;,o°I) for a two-category d-dimensional problem with P(w1) = 
P(w2) = 1/2. 


(a) Show that the minimum probability of error is given by 


a 
P, = — | e™ P du, 
al: 


3 


where a = [|b — pa (1/20). 


(b) Let u, = 0 and p = (p1,..., a)’. Use the inequality from Problem 30 to show 
that Pe approaches zero as the dimension d approaches infinity. 


(c) Express the meaning of this result in words. 


32. Show that if the densities in a two-category classification problem differ markedly 
from Gaussian, the Chernoff and Bhattacharyya bounds are not likely to be informa- 
tion by considering the following one-dimensional examples. Consider a number of 
problems in which the mean and variance are the same (and thus the Chernoff bound 
and the Bhattacharyya bound remain the same), but nevertheless have a wide range 
in Bayes error. For definiteness, assume the distributions have means at y = — and 


p2 = +h, and of = 03 = p’. 
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(a) Use the equations in the text to calculate the Chernoff and the Bhattacharyya 
bounds on the error. 


(b) Suppose the distributions are both Gaussian. Calculate explicitly the Bayes 
error. Express it in terms of an error function erf(-) and as a numerical value. 


(c) Now consider a another case, in which half the density for w; is concentrated 
at a point z = —2y and half at x = 0; likewise (symmetrically) the density for 
wa has half its mass at x = +2u and half at x = 0. Show that the means and 
variance remain as desired, but that now the Bayes error is 0.5. 


(d) Now consider yet another case, in which half the density for w; is concentrated 
near x = —2 and half at x = —e, where e is an infinitessimally small positive 
distance; likewise (symmetrically) the density for wa has half its mass near 
x = +2u and half at +e. Show that by making e sufficiently small, the means 
and variances can be made arbitrarily close to y and p?, respectively. Show, 
too, that now the Bayes error is zero. 


(e) Compare your errors in (b), (c) and (d) to your Chernoff and Bhattacharyya 
bounds of (a) and explain in words why those bounds are unlikely to be of much 
use if the distributions differ markedly from Gaussians. 


33. Suppose we know exactly two arbitrary distributions p(x|w;) and priors P(w;) 
in a d-dimensional feature space. 


(a) Prove that the true error cannot decrease if we first project the distributions to 
a lower dimensional space and then classify them. 


(b) Despite this fact, suggest why in an actual pattern recognition application we 
might not want to include an arbitrarily high number of feature dimensions. 


section 2.8 


34. Show for non-pathological cases that if we include more feature dimensions 
in a Bayesian classifier for multidimensional Gaussian distributions then the Bhat- 
tacharyya bound decreases. Do this as follows: Let Pa(P(w1), p1, X1, P(w2), H2, U2), 
or simply P4, be the Bhattacharyya bound if we consider the distributions restricted 
to d dimensions. 


(a) Using general properties of a covariance matrix, prove that k(1/2) of Eq. 75 
must increase as we increase from d to d+ 1 dimensions, and hence the error 
bound must decrease. 


(b) Explain why this general result does or does not depend upon which dimension 
is added. 


(c) What is a “pathological” case in which the error bound does not decrease, i.e., 
for which Pasi = Pa? 


(d) Is it ever possible that the true error could increase as we go to higher dimension? 


(e) Prove that as d — oo, Py — 0 for non-pathological distributions. Describe 
pathological distributions for which this infinite limit does not hold. 
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(f) Given that the Bhattacharyya bound decreases for the inclusion of a particular 
dimension, does this guarantee that the true error will decrease? Explain. 
35. Derive Eqs. 72 & 73 from Eq. 71 by the following steps: 


(a) Substitute the normal distributions into the integral and gather the terms de- 
pendent upon x and those that are not dependent upon x. 


(b) Factor the term independent of x from the integral. 
(c) Integrate explicitly the term dependent upon x. 


36. Consider a two-category classification problem in two dimensions with p(x|w1) ~ 
N(0,D, p(xlw2) ~ N (Gil) and P(w1) = P(w2) = 1/2. 

(a) Calculate the Bayes decision boundary. 

(b) Calculate the Bhattacharyya error bound. 


(c) Repeat the above for the same prior probabilities, but p(x|w;) ~ N (0, (5 a) 
and p(x|w2) ~ N (7), G 5))- 


37. Derive the Bhattacharyya error bound without the need for first examining the 
Chernoff bound. Do this as follows: 


(a) If a and b are nonnegative numbers, show directly that min[a,b] < Vab. 


(b) Use this to show that the error rate for a two-category Bayes classifier must 
satisfy 


P(error) < y P(w1)P(w2) p< p/2, 


where p is the so-called Bhattacharyya coefficient 


j= / J plein) paez) dx. 


38. Use the signal detection theory, the notation and basic Gaussian assumptions 
described in the text to address the following. 


(a) Prove that P(x > x*|x € wa) and P(x < x*|x € wa), taken together, uniquely 
determine the discriminability d’. 


(b) Use error functions erf(-) to express d’ in terms of the hit and false alarm rates. 
Estimate d’ if P(x > x*lx € wa) = 0.8 and P(x < x*lx € wa) = 0.3. Repeat for 
P(x > x*|xz € wa) = 0.7 and P(x < a*|x € wa) = 0.4. 


(c) Given that the Gaussian assumption is valid, calculate the Bayes error for both 
the cases in (b). 


(d) Determine by a trivial one-line computation, which case has the higher d’: 


case A: P(x > 2* |x € wa) = 0.8, P(x < a*|x € wa) = 0.3 or 
case B: P(x > a*|x € w2) = 0.9, P(x < a* |x € wa) = 0.7. 
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Explain your logic. 


39. Suppose in our signal detection framework we had two Gaussians, but with dif- 

ferent variances (cf., Fig. 2.20), that is, p(xlw, ~ N(m,02) and p(x|w2) ~ N (u2, 03) 
for u > pı and 03 4 o?. In that case the resulting ROC curve would no longer be 
symmetric. 


(a) Suppose in this asymmetric case we modified the definition of the discriminabil- 
ity to be d}, = |2 — 1 1//0102. Show by non-trivial counterexample or analysis 
that one cannot determine d/, uniquely based on a single pair of hit and false 
alarm rates. 


(b) Assume we measure the hit and false alarm rates for two different, but unknown, 
values of the threshold x*. Derive a formula for d/, based on measurements. 


(c) State and explain all pathological values for which your formula does not give 
a meaningful value for d,. 


(d) Plot several ROC curves for the case p(xz|w,) ~ N(0,1) and p(a|we) ~ N(1, 2). 


40. Consider two one-dimensional triangle distributions having different means, but 
the same width: 


p(xlw;) = T(m, 0) = { ‘ a a | 


with u2 > 11. We define a new discriminability here as dp = (u2 — p11)/0;. 


(a) Write an analytic function, parameterized by d}, for the operating characteristic 
curves. 


(b) Plot these novel operating characteristic curves for d} = {.1,.2,..., 1.0}. Inter- 
pret your answer for the case d = 1.0. 


(c) Suppose we measure P(x > x*|x € w2) = .4 and P(x > a*|a € w1) = .2. What 
is dp? What is the Bayes error rate? 


(d) Infer the decision rule. That is, express x* in terms of the variables given in the 
problem. 


(e) Suppose we measure P(x > a*|x € wa) = .9 and (a > a*lx € w,) = .3. What is 
dp? What is the Bayes error rate? 


(£) Infer the decision rule. That is, express x* in terms of the variables given in the 
problem. 


41. Equation 70 can be used to obtain an upper bound on the error. One can 
also derive tighter analytic bounds in the two-category case — both upper and lower 
bounds — analogous to Eq. 71 for general distributions. If we let p = p(a|w1), then 
we seek tighter bounds on Min][p, 1 — p] (which has discontinuous derivative). 


(a) Prove that 


1 1+e4 
br (p) = a È | 


for any 8 > 0 is a lower bound on Min][p, 1 — p]. 
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(b) Prove that one can choose ( in (a) to give an arbitrarily tight lower bound. 
(c) Repeat (a) and (b) for the upper bound given by 


bu (p) = br (p) + [1 — 291 (0.5)|ba(p) 


where ba(p) is any upper bound that obeys 


ba(p) > Min[p,1—p| 
alp) = be(1—p) 
(0) = be(1)=0 
be (0.5) = 0.5. 


(d) Confirm that ba(p) = 1/2sin[mp] obeys the conditions in (c). 
(e) Let ba(p) = 1/2sin[rp], and plot your upper and lower bounds as a function of 


p, for 0 < p < 1 and P = 1,10, 50. 


@Section 2.9 


42. Let the components of the vector x = (x1, ..., 14)? be binary valued (0 or 1) and 
P(wj) be the prior probability for the state of nature w; and j = 1,...,c. Now define 


Pig = Prob(x; = 1|w;) 


with the components of x; being statistically independent for all x in wj. 
(a) Interpret in words the meaning of pij. 


(b) Show that the minimum probability of error is achieved by the following decision 
rule: Decide wz if g(x) > g;(x) for all j and k, where 


- En roe En (1— pij) +n P(w;). 


— Pij 
43. Let the components of the vector x = (x1,..., 14)? be ternary valued (1, 0 or 
—1), with 


Prob(x; = 1 lwi) 
qij = Prob(z; = 0 lo) 
Prob(x; = —1|w,), 


IS 
>=. 
l 


3 
& 
l 


and with the components of x; being statistically independent for all x in wj. 


(a) Show that a minimum probability of error decision rule can be derived that 
involves discriminant functions g;(x) that are quadratic function of the compo- 
nents zi. 


(b) Suggest a generalization to more categories of your answers to this and Prob- 
lem 42. 
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44. Let x be distributed as in Problem 42 with c = 2, d odd, and 


Pil 


p> 1/2 todd 
Pz = 1l-p i= 


1,...d 
and P(w ) = P(w2) = 1/2. 
(a) Show that the minimum-error-rate decision rule becomes: 


d 
Decide wy if AE > d/2 and wa otherwise. 
i=1 
(b) Show that the minimum probability of error is given by 


(d—1)/2 


Pdp) = Y (ao. 


k=0 
where (2) = d!/(k!(d — k)!) is the binomial coefficient. 
(c) What is the limiting value of P.(d,p) as p > 1/2? Explain. 
(d) Show that P.(d,p) approaches zero as d —> oo. Explain. 


45. Under the natural assumption concerning losses, i.e., that A21 > A11 and A12 > 
22, show that the general minimum risk discriminant function for the independent 
binary case described in Sect. 2.9.1 is given by g(x) = w'x+wo, where w is unchanged, 
and 


d 
1— pi P(w) A21 — Ait 
o Di I=g "Pe ipia 


46. The Poisson distribution for a discrete variable x = 0, 1, 2, ... and real parameter 
A is 


EY 


P(xlA) =e a 


(a) Prove that the mean of such a distribution is E[z] = A. 
(b) Prove that the variance of such a distribution is E[% — z] = A. 


(c) The mode of a distribution is the value of x that has the maximum probability. 
Prove that the mode of a Poisson distribution is the greatest integer that does 
not exceed A, i.e., the mode is [A]. (If A is an integer, then both A and A—1 
are modes.) 


(d) Consider two equally probable categories having Poisson distributions but with 
differing parameters; assume for definiteness Ay > Ag. What is the Bayes clas- 
sification decision? 


(e) What is the Bayes error rate? 
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@Section 2.10 


47. Suppose we have three categories in two dimensions with the following underlying 
distributions: 


e p(x|wi) ~ N(0,T) 

e p(xlw2) ~ N ((;),1) 

e p(xlws) ~ 3N (Cs), D +30 (05),D 
with P(w;) = 1/3, i = 1,2,3. 


(a) By explicit calculation of posterior probabilities, classify the point x = (3) for 
minimum probability of error. 


(b) Suppose that for a particular test point the first feature is missing. That is, 


classify x = (3): 


(c) Suppose that for a particular test point the second feature is missing. That is, 
classify x = e 


(d) Repeat all of the above for x = (En 


48. Show that Eq. 93 reduces to Bayes rule when the true feature is u; and 
p(x»[x,) ~ N(x;, E). Interpret this answer in words. 


section 2.11 


49. Suppose we have three categories with P(w1) = 1/2, P(w2) = P(w3) = 1/4 and 
the following distributions 


e píxlwv) ~ N(0,1) 
e p(x|w2) ~ N(.5,1) 
e p(xlw3)~ N(1,1), 
and that we sample the following four points: x = 0.6, 0.1, 0.9, 1.1. 


(a) Calculate explicitly the probability that the sequence actually came from w1, w3, w3, w2. 
Be careful to consider normalization. 


(b) Repeat for the sequence w1, w2, W2, w3. 


(c) Find the sequence having the maximum probability. 
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Computer exercises 


Several of the computer exercises will rely on the following data. 


Wy wa W3 
sample 21 T2 T3 21 29 T3 21 T2 
1 5.01 8.12 3.68 0.91 0.18 0.05 5.35 2.26 
2 543 3.48 3.54 1.30 —2.06 —3.53 5.12 3.22 —2.66 
3 1.08 —5.52 1.66 7.75 4.54 0.95 1.34 5.31 
4 0.86 —3.78 —4.11 || —5.47 0.50 3.92 4.48 3.42 
5 —2.67 0.63 7.39 6.14 5.72 —4.85 7.11 2.39 3 
6 4.94 3.29 2.08 3.60 1.26 4.36 TAT 4.33 —0.98 
7 —2.51 2.09 —2.59 5.37 —4.63 —3.65 5.75 3.97 
8 2.25 2.13 6.94 7.18 1.46 —6.66 0.77 0.27 
9 5.56 2.86 —2.26 || —7.39 1.17 6.30 0.90  —0.43 —8.71 
10 1.03 —3.33 4.33 7.50 6.32 0.31 3.52 —0.36 


Q Section 2.2 


1. You may need the following procedures for several exercises below. 


(a) Write a procedure to generate random samples according to a normal distribu- 
tion N(u, ©) in d dimensions. 


(b) Write a procedure to calculate the discriminant function (of the form given in 
Eq. 47) for a given normal distribution and prior probability P(w;). 


(c) Write a procedure to calculate the Euclidean distance between two arbitrary 
points. 


(d) Write a procedure to calculate the Mahalanobis distance between the mean m 
and an arbitrary point x, given the covariance matrix >». 


Q Section 2.5 


2. Use your classifier from Problem ?? to classify the following 10 samples from 
the table above in the following way. Assume that the underlying distributions are 
normal. 


(a) Assume that the prior probabilities for the first two categories are equal (P(w1) = 
P(w2) = 1/2 and P(w3) = 0) and design a dichotomizer for those two categories 
using only the x, feature value. 


(b) Determine the empirical training error on your samples, i.e., the percentage of 
points misclassified. 


(c) Use the Bhattacharyya bound to bound the error you will get on novel patterns 
drawn from the distributions. 


(d) Repeat all of the above, but now use two feature values, 11, and x2. 
(e) Repeat, but use all three feature values. 


(£) Discuss your results. In particular, is it ever possible for a finite set of data that 
the empirical error might be larger for more data dimensions? 
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3. Repeat Computer exercise 2 but for categories w and ws. 
4. Repeat Computer exercise 2 but for categories wa and ws. 
5. Consider the three categories in Computer exercise 2, and assume P(w;) = 1/3. 


(a) What is the Mahalanobis distance between each of the following test points and 
each of the category means in Computer exercise 2: (1,2,1)?, (5,3, 2)’, (0,0,0), 
(1,0,0), 


(b) Classify those points. 


(c) Assume instead that P(w,) = 0.8, and P(w2) = P(w3) = 0.1 and classify the 
test points again. 


6. Illustrate the fact that the average of a large number of independent random 
variables will approximate a Gaussian by the following: 


(a) Write a program to generate n random integers from a uniform distribution 
U (z1, £u). (Some computer systems include this as a single, compiled function 
call.) 


(b) Now write a routine to choose a; and x, randomly, in the range —100 < a; < 
Ly < +100, and n (the number of samples) randomly in the range 0 < n < 1000. 


(c) Generate and plot a histogram of the accumulation of 10% points sampled as 
just described. 


(d) Calculate the mean and standard deviation of your histogram, and plot it 


(e) Repeat the above for 10° and for 10%. Discuss your results. 


Q Section 2.8 


7. Explore how the empirical error does or does not approach the Bhattacharyya 
bound as follows: 


(a) Write a procedure to generate sample points in d dimensions with a normal 
distribution having mean u and covariance matrix X. 


(b) Consider p(x|w,) ~ N (OI) and p(x|w2) ~ N (D) with P(w,) = P(w2) = 


1/2. By inspection, state the Bayes decision boundary. 


(c) Generate n = 100 points (50 for wı and 50 for wa) and calculate the empirical 
error. 


(d) Repeat for increasing values of n, 100 < n < 1000, in steps of 100 and plot your 
empirical error. 


(e) Discuss your results. In particular, is it ever possible that the empirical error is 
greater than the Bhattacharyya or Chernoff bound? 


8. Consider two one-dimensional normal distributions p(z|w,) ~ N(—.5,1) and 
p(tlw,) ~ N(+.5,1) and P(w1) = P(w2) = 0.5. 


(a) Calculate the Bhattacharyya bound for the error of a Bayesian classifier. 


(b) Express the true error rate in terms of an error function, erf(-). 
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(c) Evaluate this true error to four significant figures by numerical integration (or 
other routine). 


(d 


«<= 


Generate 10 points each for the two categories and determine the empirical error 
using your Bayesian classifier. (You should recalculate the decision boundary 
for each of your data sets.) 


(e) Plot the empirical error as a function of the number of points from either dis- 
tribution by repeating the previous part for 50, 100, 200, 500 and 1000 sample 
points from each distribution. Compare your asymptotic empirical error to the 
true error and the Bhattacharyya error bound. 


9. Repeat Computer exercise 8 with the following conditions: 
(a) p(alw,) ~ N(—.5,2) and p(a|we) ~ N(.5,2), P(w1) = 2/3 and P(w2) = 1/3. 
(b) plela) ~ N(—.5,2) and plælwa) ~ N(.5,2) and Pur) = P(w) = 1/2. 
(c) p(alw,) ~ N(-.5,3) and p(z|w2) ~ N(.5,1) and P(w,) = P(w2) = 1/2. 


62 


CHAPTER 2. BAYESIAN DECISION THEORY 


Bibliography 


[1] 


Y) 


3 


10 


11 


12 


13 


Subutai Ahmad and Volker Tresp. Some solutions to the missing feature problem 
in vision. In Stephen J. Hanson, Jack D. Cowan, and C. Lee Giles, editors, Neural 
Information Processing Systems, volume 5, pages 393-400, San Mateo, CA, 1993. 
Morgan Kaufmann. 


Hadar Avi-Itzhak and Thanh Diep. Arbitrarily tight uppoer and lower bounds 
on the Bayesian probability of error. IEEE Transaction on Pattern Analysis and 
Machine Intelligence, PAMI-18(1):89-91, 1996. 


Thomas Bayes. An essay towards solving a problem in the doctrine of chances. 
Philosophical Transactions of the Royal Society (London), 53:370-418, 1763. 


James O. Berger. Minimax estimation of a multivariate normal mean under 
arbitrary quadratic loss. Journal of Multivariate Analysis, 6:256-264, 1976. 


James O. Berger. Selecting a minimax estimator of a multivariate normal mean. 
Annals of Statistics, 10:81-92, 1982. 


James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer- 
Verlag, New York, NY, 2nd edition, 1985. 


José M. Bernardo and Adrian F. M. Smith. Bayesian Theory. John Wiley, New 
York, NY, 1996. 


Anil Bhattacharyya. On a measure of divergence between two statistical popu- 
lations defined by their probability distributions. Bulletin of the Calcutta Math- 
ematical Society, 35:99-110, 1943. 


Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis 
based on the sum of observations. Annals of Mathematical Statistics, 23:493-507, 
1952. 


Chao K. Chow. An optimum character recognition system using decision func- 
tions. IRE Transactions, pages 247-254, 1957. 


Chao K. Chow. On optimum recognition error and reject tradeoff. IEEE Trans- 
actions on Information Theory, IT-16:41—46, 1970. 


Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley 
Interscience, New York, NY, 1991. 


Morris H. DeGroot. Optimal Statistical Decisions. McGraw Hill, New York, NY, 
1970. 


63 


64 


14 


15 


16 


17 


18 


[19 


20 


21 


22 


23 
24 


25 


26 


27 


28 


29 


30 


31 


32 


BIBLIOGRAPHY 


Bradley Efron and Carl Morris. Families of minimax estimators of the mean of 
a multivariate normal distribution. Annals of Statistics, 4:11-21, 1976. 


Thomas S. Ferguson. Mathematical Statistics: A Decision Theoretic Approach. 
Academic Press, New York, NY, 1967. 


Simon French. Decision Theory: An introduction to the mathematics of rational- 
ity. Halsted Press, New York, NY, 1986. 


Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic 
Press, New York, NY, 2nd edition, 1990. 


Keinosuke Fukunaga and Thomas F. Krile. Calculation of Bayes recognition error 
for two multivariate Gaussian distributions. IEEE Transactions on Computers, 
C-18:220-229, 1969. 


Izrail M. Gelfand and Sergei Vasilevich Fomin. Calculus of Variations. Prentice- 
Hall, Englewood Cliffs, NJ, translated from the Russian by Richard A. Silverman 
edition, 1963. 


Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian 
Data Analysis. Chapman & Hall, New York, NY, 1995. 


David M. Green and John A. Swets. Signal Detection Theory and Psychophysics. 
Wiley, New York, NY, 1974. 


David J. Hand. Construction and Assessment of Classification Rules. Wiley, 
New York, NY, 1997. 


Thomas Kailath. Linear Systems. Prentice-Hall, Englewood Cliffs, NJ, 1980. 


Bernard Kolman. Elementary Linear Algebra. MacMillan College Division, New 
York, NY, fifth edition, 1991. 


Pierre Simon Laplace. Théorie Analytique des Probabiltiés. Courcier, Paris, 
France, 1812. 


Peter M Lee. Bayesian Statistics: An Introduction. Edward Arnold, London, 
UK, 1989. there is no period after the M in his name. 


Dennis V. Lindley. Making Decisions. Wiley, New York, NY, 1991. 


Jerzy Neyman and Egon S. Pearson. On the problem of the most efficient tests of 
statistical hypotheses. Philosophical Transactions of the Royal Society, London, 
231:289-337, 1928. 


Sheldon M. Ross. Introduction to Probability and Statistics for Engineers. John 
Wiley and Sons, New York, NY, 1987. 


Donald B. Rubin and Roderick J. A. Little. Statistical Analysis with Missing 
Data. John Wiley, New York, NY, 1987. 


Claude E. Shannon. A mathematical theory of communication. Bell Systems 
Technical Journal, 6:379-423, 623-656, 1948. 


George B. Thomas, Jr. and Ross L. Finney. Calculus and Analytic Geometry. 
Addison-Wesley, New York, NY, ninth edition, 1996. 


BIBLIOGRAPHY 65 


33 


34 


35 


36 


37 


[38 


Julius T. Tou and Rafael C. Gonzalez. Pattern Recognition Principles. Addison- 
Wesley, New York, NY, 1974. 


William R. Uttal. The psychobiology of sensory coding. HarperCollins, New York, 
NY, 1973. 


Abraham Wald. Contributions to the theory of statistical estimation and testing 
of hypotheses. Annals of Mathematical Statistics, 10:299-326, 1939. 


Abraham Wald. Statistical Decision Functions. John Wiley, New York, NY, 
1950. 


C. T. Wolverton and T. J. Wagner. Asymptotically optimal discriminant func- 
tions for pattern classifiers. IEEE Transactions on Information Theory, IT- 
15:258-265, 1969. 


C. Ray Wylie and Louis C. Barrett. Advanced Engineering Mathematics. McGraw 
Hill, New York, NY, sixth edition, 1995. 


Index 


A(-,+), see loss 

w, see state of nature, 3 
d', 34, see discriminability 
R*, see Euclidean space 
Ri, see decision, region 


action, 7 
action (a), 7 
average, see mean 


Bayes 
decision rule, 6 
Bayes’ formula, 4, 7 
Bayesian decision theory, see decision 
theory, Bayesian 
Bhattacharyya 
bound, 32 
coefficient (p), 54 
bias, 21, 38 
binary feature, see feature, binary 
binomial coefficient, 57 
bit, 16 
bound 
Bhattacharyya, see Bhattacharyya, 
bound 
boundary, see decision, boundary 


category symbol (w), 3 
Central Limit Theorem, 17 
Chernoff bound, 31 
class-conditional probability, see prob- 
ability, class-conditional 
classification 
fish example, 3 
classifier 
linear, 38 
coefficient 
Bhattacharyya, see Bhattacharyya, 
coefficient 
conditional independence, 36, 37 
conditional probability, see probabil- 
ity, conditional 


66 


conditional risk, see risk, conditional 
constraint 
risk, 12 
context 
statistical dependence, 41 
correct rejection, see rejection, correct 
covariance, 17 
covariance matrix, see matrix, covari- 
ance 
criterion 


Neyman-Pearson, see Neyman-Pearson 


criterion 


decision, 7 
Bayes, 6, 8, 36 
binary features, 36 
bias, 34 
boundary, 14 
hyperquadratic, 25 
compound, 41 
missing feature, 39 
noisy feature, 39 
randomized, 47 
region, 14 
rule, see rule 
sequential, 41 
decision theory 
Bayes, 7 
discrete features, 36 
Bayesian, 3 
continuous features, 7 
dichotomizer, 15 
discriminability, 54 
discriminability (d’), 33, see receiver 
operating characteristic 
discriminant function, 37 
discrete, 36 
distance 
Euclidean, 20 
Mahalanobis, 18 
distribution 


INDEX 


and missing data, 39 
marginal, 40 
Poisson, 57 
triangle, 55 
dot product, see inner product 


entropy, 16 
error 
Bayes, 5 
probability, 4 
discrete case, 38 
minimal, 5 
error function, 44 
Euclidean norm, see distance, Euclidean 
Euclidean space (R“), 36 
evidence, 6 
Expectation-Maximization algorithm, 40 
expected value, 15, see mean 
feature, 16 


false alarm, 34 
feature 
binary, 36 
good (uncorrupted), 40 
independence, 38 
integer valued, 36 
missing, 39 
noisy, 39-42 
space, 7 
ternary, 36 
vector 
binary, 36 
continuous, 7 
fish 
classification example, 41 
occlusion, 39 


game theory, 11 
Gaussian 
distribution, 42 
multidimensional, 17 
one-dimensional, 16 
univariate, 16 


hit, 34 

hypercube, 38 
hyperellipsoid, 25 
hyperparaboloid, 25 
hyperplane, 25, 38 
hyperquadric, 25 
hypersphere, 25 


67 
independence 
conditional, see conditional inde- 
pendence 


statistical, 17, 41 
inner product, 17 


joint probability, see probability, joint 


knowledge 
prior, 3 


likelihood, 5, 37 
ratio, 37 

loss 
classification, 9 
expected, 7 
function, 7 
matrix, 7, 34 
minimal, 8 
symmetric, 9 
zero-one, 9 


Mahalanobis distance, see distance, Ma- 
halanobis 
marginal distribution, see distribution, 
marginal 
marginalize, 40 
matching 
template, see template matching 
matrix 
covariance, 17 
mean, 16, 39 
minimax risk, see risk, minimax 
miss, 34 
mode, 57 


nat, 16 
Neyman-Pearson criterion, 12 
noise 

model, 40, 41 


norm, see distance or metric 


omega (w), see state of nature or cat- 
egory symbol 
operating characteristic, 33-35 


Poisson distribution, see distribution, 
Poisson 

polychotomizer, 15 

posterior probability, see probability, 
posterior 


68 


prior, 3, 4, 7, 38 
probability, 36 
a posteriori, 5 
a priori, 3 
class-conditional, 37 
conditional, 4, 6 
density, 4, 36 
singular, 36 
state-conditional, 7 
joint, 4 
prior, 4 


random variable, see variable, random 

randomized decision rule, see decision 
rule, randomized 

receiver operating characteristic (ROC), 


34 
reject option, 7, 47 
rejection 
correct, 34 


risk, 8 
conditional, 8, 36 
minimax, 11 
overall, 36 
ROC, see receiver operating character- 
istic 
rule 
decision, 4 


signal detection theory, 33 
space 
Euclidean (R£), 7 
state of nature (w), 3 
state-conditional probability density, see 
probability density, see prob- 
ability density, state conditional 
statistical 
dependence, 41 
independence, 17 
noise, 40 


template matching, 22 
ternary feature, see feature, ternary 
threshold, 21, see bias 
threshold weight, 38 
transform 
whitening, 18 


variable 
random, 41 
variance, 16 


INDEX 


whitening transform, see transform, whiten- 
ing 


zero-one loss, see loss, zero-one 


Contents 


3 Maximum likelihood and Bayesian estimation 


3.1 
3.2 


3.3 


3.4 


3.5 


3.6 


3.7 


3.8 


3.9 


3.10 


Introdtictions a. 18 e EE AY oe ee eS 
Maximum Likelihood Estimation ...................0.-. 
3.2.1 The General Principle ....................0.-. 
3.2.2 The Gaussian Case: Unknown p ...... o... ....... 
3.2.3 The Gaussian Case: Unknown wand) ............. 
A A Od a AOE Si eel ty eed hoe de d 
Bayesian estimation .. 1... 0... 2.00.00 eee ee ee 
3.3.1 The Class-Conditional Densities ................. 
3.3.2 The Parameter Distribution ...... o... .......... 
Bayesian Parameter Estimation: Gaussian Case . ............ 
3.4.1 The Univariate Case: p(ul/D) ............ o... .. 
3.4.2 The Univariate Case: p(z/D) ......o.o.o.o o... a 
3.4.3 The Multivariate Case . ... o... . . e... e... 
Bayesian Parameter Estimation: General Theory ............ 
Example 1: Recursive Bayes learning and maximum likelihood... . . 
3.5.1 When do Maximum Likelihood and Bayes methods differ? . . . 
3.5.2 Non-informative Priors and Invariance .............. 
*Sufhicient Statistics, 2. 64.4 4 4s as ie Aaa, See a 
Theorem 8.1: Factorization .. o... 
3.6.1 Sufficient Statistics and the Exponential Family. ........ 
Problems of Dimensionality ........ o... o... ......... 
3.7.1 Accuracy, Dimension, and Training Sample Size. ........ 
3.7.2 Computational Complexity ...... o... a 
did Overfitting «i. 5 5 ¢ aie edd ads te 
*Expectation-Maximization (EM). .......o. o... o... ..... 
Algorithm 1: Expectation-Mazimization ....o.o o... o... 
Example 2: Expectation-Maximization for a 2D normal model ..... 
*Bayesian Belief Networks . ........o.o. o... e... o... 
Example 3: Belief network for fish... ooo... o... oo... 
*Hidden Markov Models... ......0.00.000 000 ee eee ee eee 
3.10.1 First-order Markov models ..................0-. 
3.10.2 First-order hidden Markov models ................ 
3.10.3 Hidden Markov Model Computation ............... 
3.10.4° Evaluation” rom a ee a 
Algorithm 2: Forward e xs se nii eide d aea a a e k E a A 
Algorithm 8: Backward...... o... 
Example 4: Hidden Markov Model .............0.0000. 


1 


CONTENTS 


3:10:50 DECO: ad a ew ae ao ES eS Gd hd eee nea S 49 
Algorithm 4: HMM decode... ...... ooo 49 
Example 5: HMM decoding ...... o... oo... e... 50 
A A & Bed A Row Bate dh he ee eo ce HSE 51 
Algorithm 5: Forward-Backward. ooo... 52 
SUMMATy 24 esa aoe Aw ke de Ea OR ed OE eee ws ae a 53 
Bibliographical and Historical Remarks ........ 0... o... o... 54 
Problems ¿emma 40062. ¢44 oe 446 @ RED Rode ae tes 54 
Computer exercises ........ ee 68 
Bibliography oe e DRA Y A eae A 72 


Index ona a e a ra Bess A gee es Se es, Sas 75 


Chapter 3 


Maximum likelihood and 
Bayesian parameter 
estimation 


3.1 Introduction 


n Chap. ?? we saw how we could design an optimal classifier if we knew the prior 
| probabilities P(w;) and the class-conditional densities p(x|w;). Unfortunately, in 
pattern recognition applications we rarely if ever have this kind of complete knowledge 
about the probabilistic structure of the problem. In a typical case we merely have 
some vague, general knowledge about the situation, together with a number of design 
samples or training data — particular representatives of the patterns we want to 
classify. The problem, then, is to find some way to use this information to design or 
train the classifier. 

One approach to this problem is to use the samples to estimate the unknown prob- 
abilities and probability densities, and to use the resulting estimates as if they were 
the true values. In typical supervised pattern classification problems, the estimation 
of the prior probabilities presents no serious difficulties (Problem 3). However, es- 
timation of the class-conditional densities is quite another matter. The number of 
available samples always seems too small, and serious problems arise when the di- 
mensionality of the feature vector x is large. If we know the number of parameters in 
advance and our general knowledge about the problem permits us to parameterize the 
conditional densities, then the severity of these problems can be reduced significantly. 
Suppose, for example, that we can reasonably assume that p(x|w;) is a normal density 
with mean u; and covariance matrix »;, although we do not know the exact values 
of these quantities. This knowledge simplifies the problem from one of estimating an 
unknown function p(x|w;) to one of estimating the parameters u; and ¥;. 

The problem of parameter estimation is a classical one in statistics, and it can be 
approached in several ways. We shall consider two common and reasonable proce- 
dures, maximum likelihood estimation and Bayesian estimation. Although the results 
obtained with these two procedures are frequently nearly identical, the approaches 
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are conceptually quite different. Maximum likelihood and several other methods view 
the parameters as quantities whose values are fixed but unknown. The best estimate 
of their value is defined to be the one that maximizes the probability of obtaining 
the samples actually observed. In contrast, Bayesian methods view the parameters as 
random variables having some known a priori distribution. Observation of the sam- 
ples converts this to a posterior density, thereby revising our opinion about the true 
values of the parameters. In the Bayesian case, we shall see that a typical effect of 
observing additional samples is to sharpen the a posteriori density function, causing 
it to peak near the true values of the parameters. This phenomenon is known as 
Bayesian learning. In either case, we use the posterior densities for our classification 
rule, as we have seen before. 

It is important to distinguish between supervised learning and unsupervised learn- 
ing. In both cases, samples x are assumed to be obtained by selecting a state of nature 
wi with probability P(w;), and then independently selecting x according to the proba- 
bility law p(x|w;). The distinction is that with supervised learning we know the state 
of nature (class label) for each sample, whereas with unsupervised learning we do not. 
As one would expect, the problem of unsupervised learning is the more difficult one. 
In this chapter we shall consider only the supervised case, deferring consideration of 
unsupervised learning to Chap. ??. 


3.2 Maximum Likelihood Estimation 


Maximum likelihood estimation methods have a number of attractive attributes. 
First, they nearly always have good convergence properties as the number of train- 
ing samples increases. Further, maximum likelihood estimation often can be simpler 
than alternate methods, such as Bayesian techniques or other methods presented in 
subsequent chapters. 


3.2.1 The General Principle 


Suppose that we separate a collection of samples according to class, so that we have c 
sets, D1, ..., De, with the samples in D; having been drawn independently according to 
the probability law p(x|w;). We say such samples are i.i.d. — independent identically 
distributed random variables. We assume that p(x|w;) has a known parametric form, 
and is therefore determined uniquely by the value of a parameter vector 0j. For 
example, we might have p(x|w;) ~ N(u;,%j), where 6, consists of the components of 
uj and Xj. To show the dependence of p(x|w;) on 0; explicitly, we write p(x|w;) as 
p(xlw;,0;,). Our problem is to use the information provided by the training samples 
to obtain good estimates for the unknown parameter vectors 04, ..., 0e associated with 
each category. 

To simplify treatment of this problem, we shall assume that samples in D; give no 
information about 0; if i 4 7 — that is, we shall assume that the parameters for the 
different classes are functionally independent. This permits us to work with each class 
separately, and to simplify our notation by deleting indications of class distinctions. 
With this assumption we thus have c separate problems of the following form: Use a 
set D of training samples drawn independently from the probability density p(x|@) to 
estimate the unknown parameter vector 6. 

Suppose that D contains n samples, x1, ...,X,. Then, since the samples were drawn 
independently, we have 
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p(D|8) = ] [ r(xx!8). (1) 
k=1 


Recall from Chap. ?? that, viewed as a function of 0, p(D|@) is called the likelihood 
of O with respect to the set of samples. The maximum likelihood estimate of @ is, by 
definition, the value @ that maximizes p(D|@). Intuitively, this estimate corresponds 
to the value of 0 that in some sense best agrees with or supports the actually observed 
training samples (Fig. 3.1). 
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Figure 3.1: The top graph shows several training points in one dimension, known or 
assumed to be drawn from a Gaussian of a particular variance, but unknown mean. 
Four of the infinite number of candidate source distributions are shown in dashed 
lines. The middle figures shows the likelihood p(D|0) as a function of the mean. If 
we had a very large number of training points, this likelihood would be very narrow. 
The value that maximizes the likelihood is marked 6; it also maximizes the logarithm 
of the likelihood — i.e., the log-likelihood 1(0), shown at the bottom. Note especially 
that the likelihood lies in a different space from p(x|0), and the two can have different 
functional forms. 


For analytical purposes, it is usually easier to work with the logarithm of the like- 
lihood than with the likelihood itself. Since the logarithm is monotonically increasing, 
the @ that maximizes the log-likelihood also maximizes the likelihood. If p(D|@) is a 
well behaved, differentiable function of 0, Ô can be found by the standard methods of 
differential calculus. If the number of parameters to be set is p, then we let O denote 
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the p-component vector O = (91,...,0,)*, and Vg be the gradient operator 
o 
001 
Va = : : (2) 
9 
30, 
We define 1(0) as the log-likelihood function* 
1(0) = In p(D|0). (3) 


We can then write our solution formally as the argument @ that maximizes the log- 
likelihood, i.e., 


0 = arg meee (4) 


where the dependence on the data set D is implicit. Thus we have from Eq. 1 


1(0) = Y In p(x«|6) (5) 
k=1 


and 


Vol = Y Vo In v(xx 6). (6) 
k=1 


Thus, a set of necessary conditions for the maximum likelihood estimate for O can be 
obtained from the set of p equations 


Vol = 0. (7) 


A solution Ó to Eq. 7 could represent a true global maximum, a local maximum or 
minimum, or (rarely) an inflection point of 1(0). One must be careful, too, to check 
if the extremum occurs at a boundary of the parameter space, which might not be 
apparent from the solution to Eq. 7. If all solutions are found, we are guaranteed 
that one represents the true maximum, though we might have to check each solution 
individually (or calculate second derivatives) to identify which is the global optimum. 
Of course, we must bear in mind that @ is an estimate; it is only in the limit of an 
infinitely large number of training points that we can expect that our estimate will 
equal to the true value of the generating function (Sec. 3.5.1). 

We note in passing that a related class of estimators — maximum a posteriori or 
MAP estimators — find the value of O that maximizes 1(0)p(0). Thus a maximum 
likelihood estimator is a MAP estimator for the uniform or “flat” prior. As such, 
a MAP estimator finds the peak, or mode of a posterior density. The drawback of 
MAP estimators is that if we choose some arbitrary nonlinear transformation of the 
parameter space (e.g., an overall rotation), the density will change, and our MAP 
solution need no longer be appropriate (Sec. 3.5.2). 


* Of course, the base of the logarithm can be chosen for convenience, and in most analytic problems 
base e is most natural. For that reason we will generally use In rather than log or log». 
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3.2.2 The Gaussian Case: Unknown p 


To see how maximum likelihood methods results apply to a specific case, suppose 
that the samples are drawn from a multivariate normal population with mean p and 
covariance matrix X. For simplicity, consider first the case where only the mean is 
unknown. Under this condition, we consider a sample point x; and find 


In p(s Ju) = -$in [2042] — 5 (ce 17 Ga — 1) (8) 


and 


Vo In p(xe|H) = E~ (xp — u). (9) 


Identifying O with u, we see from Eq. 9 that the maximum likelihood estimate for p 
must satisfy 


Y ol (x, — Ê) =0, (10) 
that is, each of the d components of à must vanish. Multiplying by % and rearranging, 
we obtain 


(11) 


This is a very satisfying result. It says that the maximum likelihood estimate for 
the unknown population mean is just the arithmetic average of the training samples 
— the sample mean, sometimes written ĝ,„ to clarify its dependence on the number 
of samples. Geometrically, if we think of the n samples as a cloud of points, the 
sample mean is the centroid of the cloud. The sample mean has a number of desirable 
statistical properties as well, and one would be inclined to use this rather obvious 
estimate even without knowing that it is the maximum likelihood solution. 


3.2.3 The Gaussian Case: Unknown pm and > 


In the more general (and more typical) multivariate normal case, neither the mean y 
nor the covariance matrix X is known. Thus, these unknown parameters constitute 
the components of the parameter vector 0. Consider first the univariate case with 
6, = u and 6 = o?. Here the log-likelihood of a single point is 


1 1 
In p(xz|0) = = In 27027 — 205 (7 =)" (12) 
and its derivative is 
97 (tk — 01) 
Val = Vo In p(x |0) = k (£k —01)? (13) 
-zz + 202 
Applying Eq. 7 to the full log-likelihood leads to the conditions 
n Ï 7 
Y gô) =0 (14) 
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and 


: =0, (15) 


where 6; and 62 are the maximum likelihood estimates for 6; and 62, respectively. By 
substituting f = 0,, 6? = 42 and doing a little rearranging, we obtain the following 
maximum likelihood estimates for u and o°: 


1 
Y = — la 1 
p a Tk ( 6) 


and 


1 
L2_2 L AN2 
a X (a pi)”. (17) 
k=1 
While the analysis of the multivariate case is basically very similar, considerably 
more manipulations are involved (Problem 6). Just as we would predict, though, the 
result is that the maximum likelihood estimates for 4 and © are given by 


pa Sa (18) 


and 


S= ES bu 10h). (19) 
k=1 
Thus, once again we find that the maximum likelihood estimate for the mean 
vector is the sample mean. The maximum likelihood estimate for the covariance 
matrix is the arithmetic average of the n matrices (Xxx — /1)(xx — ft)”. Since the true 
covariance matrix is the expected value of the matrix (x — 1) (x — 2)’, this is also a 
very satisfying result. 


3.2.4 Bias 


The maximum likelihood estimate for the variance o? is biased; that is, the expected 
value over all data sets of size n of the sample variance is not equal to the true 
variance:* 


ig n—1 
El—) (a; - > a A, (20) 
ea 
We shall return to a more general consideration of bias in Chap. ??, but for the 
moment we can verify Eq. 20 for an underlying distribution with non-zero variance, 
a°, in the extreme case of n = 1, in which the expectation value €[-] = 0 4 o°. The 
maximum likelihood estimate of the covariance matrix is similarly biased. 
Elementary unbiased estimators for 0? and Y are given by 


* There should be no confusion over this use of the statistical term bias, and that for an offset in 
neural networks and many other places. 
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Y (xe — A) (xe — A), (22) 


k=1 


where C is the so-called sample covariance matrix, as explored in Problem 33. If 
an estimator is unbiased for all distributions, as for example the variance estimator 
in Eq. 21, then it is called absolutely unbiased. If the estimator tends to become 
unbiased as the number of samples becomes very large, as for instance Eq. 20, then 
the estimator is asymptotically unbiased. In many pattern recognition problems with 
large training data sets, asymptotically unbiased estimators are acceptable. 

Clearly, $ = [(n—1)/n]C, and $ is asymptotically unbiased — these two estimates 
are essentially identical when n is large. However, the existence of two similar but 
nevertheless distinct estimates for the covariance matrix may be disconcerting, and it 
is natural to ask which one is “correct.” Of course, for n > 1 the answer is that these 
estimates are neither right nor wrong — they are just different. What the existence of 
two actually shows is that no single estimate possesses all of the properties we might 
desire. For our purposes, the most desirable property is rather complex — we want 
the estimate that leads to the best classification performance. While it is usually both 
reasonable and sound to design a classifier by substituting the maximum likelihood 
estimates for the unknown parameters, we might well wonder if other estimates might 
not lead to better performance. Below we address this question from a Bayesian 
viewpoint. 

If we have a reliable model for the underlying distributions and their dependence 
upon the parameter vector 0, the maximum likelihood classifier will give excellent 
results. But what if our model is wrong — do we nevertheless get the best classifier in 
our assumed set of models? For instance, what if we assume that a distribution comes 
from N(p, 1) but instead it actually comes from N (u, 10)? Will the value we find for 
0 = y by maximum likelihood yield the best of all classifiers of the form derived from 
N(u,1)? Unfortunately, the answer is “no,” and an illustrative counterexample is 
given in Problem 7 where the so-called model error is large indeed. This points out 
the need for reliable information concerning the models — if the assumed model is 
very poor, we cannot be assured that the classifier we derive is the best, even among 
our model set. We shall return to the problem of choosing among candidate models 
in Chap. ??. 


3.3 Bayesian estimation 


We now consider the Bayesian estimation or Bayesian learning approach to pattern 
classification problems. Although the answers we get by this method will generally 
be nearly identical to those obtained by maximum likelihood, there is a conceptual 
difference: whereas in maximum likelihood methods we view the true parameter vector 
we seek, 0, to be fixed, in Bayesian learning we consider O to be a random variable, 
and training data allows us to convert a distribution on this variable into a posterior 
probability density. 
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3.3.1 The Class-Conditional Densities 


The computation of the posterior probabilities P(w;|x) lies at the heart of Bayesian 
classification. Bayes’ formula allows us to compute these probabilities from the prior 
probabilities P(w;) and the class-conditional densities p(x|w;), but how can we proceed 
when these quantities are unknown? The general answer to this question is that the 
best we can do is to compute P(w;|x) using all of the information at our disposal. 
Part of this information might be prior knowledge, such as knowledge of the functional 
forms for unknown densities and ranges for the values of unknown parameters. Part 
of this information might reside in a set of training samples. If we again let D denote 
the set of samples, then we can emphasize the role of the samples by saying that our 
goal is to compute the posterior probabilities P(w;|x,D). From these probabilities we 
can obtain the Bayes classifier. 
Given the sample D, Bayes’ formula then becomes 


plus, DPD) 
Y polo; D)P (41D) 


P(wi|x,D) = (23) 


As this equation suggests, we can use the information provided by the training samples 
to help determine both the class-conditional densities and the a priori probabilities. 

Although we could maintain this generality, we shall henceforth assume that the 
true values of the a priori probabilities are known or obtainable from a trivial calcu- 
lation; thus we substitute P(w;) = P(w;|D). Furthermore, since we are treating the 
supervised case, we can separate the training samples by class into c subsets Dj, ..., De, 
with the samples in D; belonging to w;. As we mentioned when addressing maximum 
likelihood methods, in most cases of interest (and in all of the cases we shall consider), 
the samples in D; have no influence on p(x|w,;,D) if i 4 j. This has two simplifying 
consequences. First, it allows us to work with each class separately, using only the 
samples in D; to determine p(x|w;,D). Used in conjunction with our assumption that 
the prior probabilities are known, this allows us to write Eq. 23 as 


p(x|wi, Di) P(wi) 


P(w;|x,D) = = i 
E Pelo, Di) Plu) 


(24) 


Second, because each class can be treated independently, we can dispense with need- 
less class distinctions and simplify our notation. In essence, we have c separate prob- 
lems of the following form: use a set D of samples drawn independently according to 
the fixed but unknown probability distribution p(x) to determine p(x|D). This is the 
central problem of Bayesian learning. 


3.3.2 The Parameter Distribution 


Although the desired probability density p(x) is unknown, we assume that it has a 
known parametric form. The only thing assumed unknown is the value of a parameter 
vector 6. We shall express the fact that p(x) is unknown but has known parametric 
form by saying that the function p(x|@) is completely known. Any information we 
might have about @ prior to observing the samples is assumed to be contained in a 
known prior density p(@). Observation of the samples converts this to a posterior 
density p(@|D), which, we hope, is sharply peaked about the true value of 6. 
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Note that we are changing our supervised learning problem into an unsupervised 
density estimation problem. To this end, our basic goal is to compute p(x|D), which 
is as close as we can come to obtaining the unknown p(x). We do this by integrating 
the joint density p(x, 0|D) over 6. That is, 


p(x|D) = / p(x, 6D) dð, (25) 


where the integration extends over the entire parameter space. Now as discussed in 
Problem 12 we can write p(x, 0/D) as the product p(x|@, D)p(@|D). Since the selection 
of x and that of the training samples in D is done independently, the first factor is 
merely p(x|@). That is, the distribution of x is known completely once we know the 
value of the parameter vector. Thus, Eq. 25 can be rewritten as 


p(x|D) = J p(x|0)p(6|D) de. (26) 


This key equation links the desired class-conditional density p(x|D) to the posterior 
density p(@|D) for the unknown parameter vector. If p(@|D) peaks very sharply 
about some value @, we obtain p(x|D) ~ p(x|@), i.e., the result we would obtain by 
substituting the estimate Ô for the true parameter vector. This result rests on the 
assumption that p(x|@) is smooth, and that the tails of the integral are not important. 
These conditions are typically but not invariably the case, as we shall see in Sect. ??. 
In general, if we are less certain about the exact value of 6, this equation directs us to 
average p(x|@) over the possible values of 6. Thus, when the unknown densities have 
a known parametric form, the samples exert their influence on p(x|D) through the 
posterior density p(@|D). We should also point out that in practice, the integration 
in Eq. 26 is often performed numerically, for instance by Monte-Carlo simulation. 


3.4 Bayesian Parameter Estimation: Gaussian Case 


In this section we use Bayesian estimation techniques to calculate the a posteri- 
ori density p(@|D) and the desired probability density p(x|D) for the case where 


p(x|1) ~ N(u, 2). 


3.4.1 The Univariate Case: p(u/D) 


Consider the case where pz is the only unknown parameter. For simplicity we treat 
first the univariate case, i.e., 


p(x|u) a N(a, a); (27) 


where the only unknown quantity is the mean u. We assume that whatever prior 
knowledge we might have about u can be expressed by a known prior density p(y). 
Later we shall make the further assumption that 


p(t) ~ N(uo, 05), (28) 


where both uy and Gi, are known. Roughly speaking, uy represents our best a priori 
guess for u, and of measures our uncertainty about this guess. The assumption 
that the prior distribution for y is normal will simplify the subsequent mathematics. 
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However, the crucial assumption is not so much that the prior distribution for ju is 
normal, but that it is known. 

Having selected the a priori density for u, we can view the situation as follows. 
Imagine that a value is drawn for u from a population governed by the probability 
law p(w). Once this value is drawn, it becomes the true value of y and completely 
determines the density for x. Suppose now that n samples z1, ..., £n are independently 
drawn from the resulting population. Letting D = {£1, ..., £n}, we use Bayes’ formula 
to obtain 


p(D|u)p(u) 
J p(Plu)p(u) du 


n 


a | | o lr, (29) 


k=1 


p(ulD) = 


II 


where a is a normalization factor that depends on D but is independent of u. This 
equation shows how the observation of a set of training samples affects our ideas about 
the true value of u; it relates the prior density p(y) to an a posteriori density p(u|D). 
Since p(zp|u) ~ N(u,0?) and p(n) ~ N(po, 08), we have 


p(zr|u) p(n) 
=. oN —_——_ E 
1 1 /£k— HN? 1 1 /u-— oy? 
By = eM OO) a a | 
p(ulD) lI m a ~ ae 


II 
Q 
lo] 
ES 
© 


; (Es) | pa” 
a” a | € a) (5a) | | (30) 


where factors that do not depend on y have been absorbed into the constants a, 
a’, and a”. Thus, p(u|D) is an exponential function of a quadratic function of p, 
i.e., is again a normal density. Since this is true for any number of training samples, 
p(u|P) remains normal as the number n of samples is increased, and p(ju|D) is said 
to be a reproducing density and p(p) is said to be a conjugate prior. If we write 
p(u\D) ~ N(un, 02), then un and 0? can be found by equating coefficients in Eq. 30 
with corresponding coefficients in the generic Gaussian of the form 


p(u|D) = — a | (s) (31) 


Identifying coefficients in this way yields 


II 


1 n 1 
cee ee wee ee eee 32 
o2 a ee aa 
and 
Hn m Ho 
= n 7 y 33 
o2 0 oe co 


where Z,, is the sample mean 
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We solve explicitly for un and o? and obtain 


2 2 
no; _ o 
EEN ELEN + — 35 
Hn (setae) ss nog + 0240 P 
and 
Da 
2 aya 
=, 36 
On noe + 0? (30) 


These equations show how the prior information is combined with the empirical 
information in the samples to obtain the a posteriori density p(u|D). Roughly speak- 
ing, Un represents our best guess for u after observing n samples, and 0? measures 
our uncertainty about this guess. Since ø? decreases monotonically with n — ap- 
proaching o?/n as n approaches infinity — each additional observation decreases our 
uncertainty about the true value of y. As n increases, p(u|D) becomes more and 
more sharply peaked, approaching a Dirac delta function as n approaches infinity. 
This behavior is commonly known as Bayesian learning (Fig. 3.2). 


Figure 3.2: Bayesian learning of the mean of normal distributions in one and two di- 
mensions. The posterior distribution estimates are labelled by the number of training 
samples used in the estimation. 


In general, un is a linear combination of Z, and uo, with coefficients that are 
non-negative and sum to one. Thus un always lies somewhere between Z, and uo. If 
a £0, Un approaches the sample mean as n approaches infinity. If o9 = 0, we have 
a degenerate case in which our a priori certainty that y = po is so strong that no 
number of observations can change our opinion. At the other extreme, if oo > o, we 
are so uncertain about our a priori guess that we take up, = Zn, using only the samples 
to estimate u. In general, the relative balance between prior knowledge and empirical 
data is set by the ratio of o? to aĝ, which is sometimes called the dogmatism. If the 
dogmatism is not infinite, after enough samples are taken the exact values assumed 
for uo and o@ will be unimportant, and un will converge to the sample mean. 
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3.4.2 The Univariate Case: p(x|D) 


Having obtained the a posteriori density for the mean, p(u|D), all that remains is to 
obtain the “class-conditional” density for p(a|D).* From Eqs. 26, 27 & 31 we have 


ple) = [rtelmelD) du 
= Pa ai a RA 
= Won P [ seated F(a, an), (37) 
where 
lo? +02 072 +07 pr, 


Flovon) = [exp] 2 ae (u o? +02 y] al 


That is, as a function of x, p(=[D) is proportional to exp|—(1/2)(1— pun)? /(0? +02), 
and hence p(a|D) is normally distributed with mean ju and variance 0? + 02: 


p(x|D) ~ N(un, 0? +07). (38) 


In other words, to obtain the class-conditional density p(a|D), whose parametric 
form is known to be p(a|) ~ N(u, 07), we merely replace u by un and 0? by o? +02. 
In effect, the conditional mean jz, is treated as if it were the true mean, and the 
known variance is increased to account for the additional uncertainty in x resulting 
from our lack of exact knowledge of the mean u. This, then, is our final result: 
the density p(x|D) is the desired class-conditional density p(x|w,;,D,;), and together 
with the prior probabilities P(w;) it gives us the probabilistic information needed to 
design the classifier. This is in contrast to maximum likelihood methods that only 
make points estimates for i and o?, rather that estimate a distribution for p(a[D). 


3.43 The Multivariate Case 


The treatment of the multivariate case in which © is known but p is not, is a di- 
rect generalization of the univariate case. For this reason we shall only sketch the 
derivation. As before, we assume that 


p(x]u) ~ N(u,X) and p(y) ~ N(1o, Lo), (39) 


where X, No, and po are assumed to be known. After observing a set D of n inde- 
pendent samples X1, ..., Xn, we use Bayes’ formula to obtain 


P(ulD) = af] pwp) (40) 
k=1 
= a’exp E (Hox + Ep )u — Qype (= ye + 55%) i 
k=1 


* Recall that for simplicity we dropped class distinctions, but that all samples here come from the 
same class, say w;, and hence p(x|D) is really p(x|wi, Di). 
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which has the form 


P(ulD) =0"exp |= 5 (1 JE a bn) (41) 


Thus, p(u|D) ~ N(p,,, 4), and once again we have a reproducing density. Equating 
coefficients, we obtain the analogs of Eqs. 35 & 36, 


SN, =nX ‘435° (42) 
and 
Ez Hn = nE p+ Ey Ho, (43) 
where ĝ,„ is the sample mean 
A ya (44) 
Hn = ñ a k- 


The solution of these equations for q and Xp is simplified by knowledge of the matrix 
identity 


(A-'+B-)-' = A(A+B)"'B=B(A+B)"1A, (45) 
which is valid for any pair of nonsingular, d-by-d matrices A and B. After a little 
manipulation (Problem 16), we obtain the final results: 

1 1 1 1 1 

p.=Eo(Lo+7E) fy + E(Xo+ E) mo (46) 

n n n 
(which, as in the univariate case, is a linear combination of f4,, and y) and 


Y, = Eo (Zo + Y (47) 


The proof that p(x|D) ~ N(p,,, 2 +2,,) can be obtained as before by performing 
the integration 


p(x|D) = J p(x|)p(uID) dy. (48) 


However, this result can be obtained with less effort by observing that x can be viewed 
as the sum of two mutually independent random variables, a random vector pe with 
p(pID) ~ N(Hn,©n) and an independent random vector y with p(y) ~ N(0, 5). 
Since the sum of two independent, normally distibuted vectors is again a normally 
distributed vector whose mean is the sum of the means and whose covariance matrix 
is the sum of the covariance matrices (Chap. ?? Problem ??), we have 


P(x|D) ~ N(Hn, E + En), (49) 


and the generalization is complete. 
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3.5 Bayesian Parameter Estimation: General Theory 


We have just seen how the Bayesian approach can be used to obtain the desired density 
p(x|D) in a special case — the multivariate Gaussian. This approach can be gener- 
alized to apply to any situation in which the unknown density can be parameterized. 
The basic assumptions are summarized as follows: 


e The form of the density p(x|@) is assumed to be known, but the value of the 
parameter vector 0 is not known exactly. 


e Our initial knowledge about O is assumed to be contained in a known a priori 
density p(@). 


e The rest of our knowledge about 0 is contained in a set D of n samples xj, ..., Xn 
drawn independently according to the unknown probability density p(x). 


The basic problem is to compute the posterior density p(@|D), since from this we 
can use Eq. 26 to compute p(x|D): 


p(xID) = f r(x/8)p(6|D) de. (50) 
By Bayes’ formula we have 
p(D|)p(9) 
POD) = ———, 51 
(OP) = FDO) de i 
and by the independence assumption 
p(D|8) = | [ p(xx|9). (52) 
k=1 


This constitutes the solution to the problem, and Eqs. 51 & 52 illuminate its 
relation to the maximum likelihood solution. Suppose that p(D|@) reaches a sharp 
peak at @ = @. If the prior density p(@) is not zero at 0 = Ô and does not change 
much in the surrounding neighborhood, then p(@|D) also peaks at that point. Thus, 
Eq. 26 shows that p(x|D) will be approximately p(x|@), the result one would obtain 
by using the maximum likelihood estimate as if it were the true value. If the peak 
of p(D|0) is very sharp, then the influence of prior information on the uncertainty in 
the true value of 0 can be ignored. In this and even the more general case, though, 
the Bayesian solution tells us how to use all the available information to compute the 
desired density p(x|D). 

While we have obtained the formal Bayesian solution to the problem, a number 
of interesting questions remain. One concerns the difficulty of carrying out these 
computations. Another concerns the convergence of p(x|D) to p(x). We shall discuss 
the matter of convergence briefly, and later turn to the computational question. 

To indicate explicitly the number of samples in a set for a single category, we shall 
write D” = {x),...,xn}. Then from Eq. 52, ifn > 1 


p(D”|8) = p(xn|@)p(D"~*|8). (53) 


Substituting this in Eq. 51 and using Bayes’ formula, we see that the posterior density 
satisfies the recursion relation 
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ay P(X nl@)p(8ID"-") 
PID") = Fe |)p(O|D™—) dO 


(54) 


With the understanding that p(@|D°) = p(@), repeated use of this equation pro- 
duces the sequence of densities p(@), p(@|x1), p(@|x1, X2), and so forth. (It should be 
obvious from Eq. 54 that p(@|D") depends only on the points in D”, not the sequence 
in which they were selected.) This is called the recursive Bayes approach to param- 
eter estimation. This is, too, our first example of an incremental or on-line learning 
method, where learning goes on as the data is collected. When this sequence of den- 
sities converges to a Dirac delta function centered about the true parameter value — 
Bayesian learning (Example 1). We shall come across many other, non-incremental 
learning schemes, where all the training data must be present before learning can take 
place. 

In principle, Eq. 54 requires that we preserve all the training points in D’~! in 
order to calculate p(@|D”) but for some distributions, just a few parameters associated 
with p(@|D"~*) contain all the information needed. Such parameters are the sufficient 
statistics of those distributions, as we shall see in Sect. 3.6. Some authors reserve the 
term recursive learning to apply to only those cases where the sufficient statistics are 
retained — not the training data — when incorporating the information from a new 
training point. We could call this more restrictive usage true recursive Bayes learning. 


Example 1: Recursive Bayes learning | 


Suppose we believe our one-dimensional samples come from a uniform distribution 


1/19 0<x<0 
p(z|0) ~ U(0,0) = { H otherwise 


but initially we know only that our parameter is bounded. In particular we assume 
0<6< 10 (a non-informative or “flat prior” we shall discuss in Sect. 3.5.2). We 
will use recursive Bayes methods to estimate 0 and the underlying densities from the 
data D = {4,7,2,8}, which were selected randomly from the underlying distribution. 
Before any data arrive, then, we have p(@|D°) = p(0) = U (0,10). When our first data 
point xı = 4 arrives, we use Eq. 54 to get an improved estimate: 


1 oy J 1/0 for4<0<10 
PUP Bele ee = { 0 otherwise, 


where throughout we will ignore the normalization. When the next data point xa = 7 
arrives, we have 


, 1) f 1/02 for 7<0<10 
P(O|D*) x p(x10)p(0|D") = { 0 otherwise, 


and similarly for the remaining sample points. It should be clear that since each 
successive step introduces a factor of 1/0 into p(x|0), and the distribution is nonzero 
only for x values above the largest data point sampled, the general form of our solution 
is p(0|D") x 1/0” for max[D"] < 0 < 10, as shown in the figure. Given our full data 
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The posterior p(@|D”) for the model and n points in the data set in this Example. 
The posterior begins p(0) ~ U(0, 10), and as more points are incorporated it becomes 
increasingly peaked at the value of the highest data point. 


set, the maximum likelihood solution here is clearly 6= 8, and this implies a uniform 
p(x|D) ~ U(0,8). 

According to our Bayesian methodology, which requires the integration in Eq. 50, 
the density is uniform up to x = 8, but has a tail at higher values — an indication 
that the influence of our prior p(@) has not yet been swamped by the information in 


the training data. 
p(x|9) 


0.2 


ML 


0.1 + Bayes 


= xX 
0 2 4 6 8 10 


Given the full set of four points, the distribution based on the maximum likelihood 
solution is p(x]0) ~ U(0,8), whereas the distribution derived from Bayesian methods 
has a small tail above x = 8, reflecting the prior information that values of x near 10 
are possible. 


Whereas the maximum likelihood approach estimates a point in @ space, the 
Bayesian approach instead estimates a distribution. Technically speaking, then, we 
cannot directly compare these estimates. It is only when the second stage of inference 
is done — that is, we compute the distributions p(x|D), as shown in the above figure 
— that the comparison is fair. 


For most of the typically encountered probability densities p(x|@), the sequence of 
posterior densities does indeed converge to a delta function. Roughly speaking, this 
implies that with a large number of samples there is only one value for O that causes 
p(x|@) to fit the data, i.e., that O can be determined uniquely from p(x|@). When this 
is the case, p(x|@) is said to be identifiable. A rigorous proof of convergence under 
these conditions requires a precise statement of the properties required of p(x|@) and 
p(0) and considerable care, but presents no serious difficulties (Problem 21). 

There are occasions, however, when more than one value of 8 may yield the same 
value for p(x|0). In such cases, O cannot be determined uniquely from p(x|@), and 
p(x|D”) will peak near all of the values of O that explain the data. Fortunately, this 
ambiguity is erased by the integration in Eq. 26, since p(x|@) is the same for all of 
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these values of 6. Thus, p(x|D”) will typically converge to p(x) whether or not p(x|@) 
is identifiable. While this might make the problem of identifiabilty appear to be moot, 
we shall see in Chap. ?? that identifiability presents a genuine problem in the case of 
unsupervised learning. 


3.5.1 When do Maximum Likelihood and Bayes methods differ? 


In virtually every case, maximum likelihood and Bayes solutions are equivalent in the 
asymptotic limit of infinite training data. However since practical pattern recognition 
problems invariably have a limited set of training data, it is natural to ask when 
maximum likelihood and Bayes solutions may be expected to differ, and then which 
we should prefer. 

There are several criteria that will influence our choice. One is computational 
complexity (Sec. 3.7.2), and here maximum likelhood methods are often to be pref- 
ered since they require merely differential calculus techniques or gradient search for 6, 
rather than a possibly complex multidimensional integration needed in Bayesian esti- 
mation. This leads to another consideration: interpretability. In many cases the max- 
imum likelihood solution will be easier to interpret and understand since it returns the 
single best model from the set the designer provided (and presumably understands). 
In contrast Bayesian methods give a weighted average of models (parameters), often 
leading to solutions more complicated and harder to understand than those provided 
by the designer. The Bayesian approach reflects the remaining uncertainty in the 
possible models. 

Another consideration is our confidence in the prior information, such as in the 
form of the underlying distribution p(x|@). A maximum likelihood solution p(x|@) 
must of course be of the assumed parametric form; not so for the Bayesian solution. 
We saw this difference in Example 1, where the Bayes solution was not of the para- 
metric form originally assumed, i.e., a uniform p(z|D). In general, through their use 
of the full p(@|D) distribution Bayesian methods use more of the information brought 
to the problem than do maximum likelihood methods. (For instance, in Example 1 
the addition of the third training point did not change the maximum likelihood so- 
lution, but did refine the Bayesian estimate.) If such information is reliable, Bayes 
methods can be expected to give better results. Further, general Bayesian methods 
with a “flat” or uniform prior (i.e., where no prior information is explicitly imposed) 
are equivalent to maximum likelihood methods. If there is much data, leading to a 
strongly peaked p(@|D), and the prior p(@) is uniform or flat, then the MAP estimate 
is essentially the same as the maximum likelihood estimate. 

When p(@|D) is broad, or asymmetric around Ô, the methods are quite likely to 
yield p(x|D) distributions that differ from one another. Such a strong asymmetry 
(when not due to rare statistical irregularities in the selection of the training data) 
generally convey some information about the distribution, just as did the asymmetric 
role of the threshold 9 in Example 1. Bayes methods would exploit such information; 
not so maximum likelihood ones (at least not directly). Further, Bayesian methods 
make more explicit the crucial problem of bias and variance tradeoffs — roughly 
speaking the balance between the accuracy of the estimation and its variance, which 
depend upon the amount of traning data. This important matter was irrelevant in 
Chap. ??, where there was no notion of a finite training set, but it will be crucial in 
our considerations of the theory of machine learning in Chap. ??. 

When designing a classifier by either of these methods, we determine the posterior 
densities for each category, and classify a test point by the maximum posterior. (If 
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there are costs, summarized in a cost matrix, these can be incorporated as well.) 
There are three sources of classification error in our final system: 


Bayes or indistinguisability error: the error due to overlapping densities p(x|w; 
for different values of i. This error is an inherent property of the problem and 
can never be eliminated. 


Model error: the error due to having an incorrect model. This error can only be 
eliminated if the designer specifies a model that includes the true model which 
generated the data. Designers generally choose the model based on knowledge 
of the problem domain rather than on the subsequent estimation method, and 
thus the model error in maximum likelihood and Bayes methods rarely differ. 


Estimation error: the error arising from the fact that the parameters are estimated 
from a finite sample. This error can best be reduced by increasing the training 
data, a topic we shall revisit in greater detail in Chap. ??. 


The relative contributions of these sources depend upon problem, of course. In the 
limit of infinite training data, the estimation error vanishes, and the total classification 
error will be the same for both maximum likelihodd and Bayes methods. 

In summary, there are strong theoretical and methodological arguments supporting 
Bayesian estimation, though in practice maximum likelihood estimation is simpler, 
and when used for designing classifiers, can lead to classifiers nearly as accurate. 


3.5.2 Non-informative Priors and Invariance 


Generally speaking, the information about the prior p(@) derives from the designer’s 
knowledge of the problem domain and as such is beyond our study of the design of 
classifiers. Nevertheless in some cases we have guidence in how to create priors that 
do not impose structure when we believe none exists, and this leads us to the notion 
of non-informative priors. 

Recall our discussion of the role of prior category probabilities in Chap. ??, where 
in the absense of other information, we assumed each of c categories equally likely. 
Analogously, in a Bayesian framework we can have a “non-informative” prior over a 
parameter for a single category’s distribution. Suppose for instance that we are using 
Bayesian methods to infer from data the mean and variance of a Gaussian. What 
prior might we put on these parameters? Surely the unit of spatial measurement — 
meters, feet, inches — is an historical accident and irrelevant to the functional form 
of the prior. Thus there is an implied scale invariance, formally stated as 


p(0) = ap(0/a) (55) 


for some constant a. Such scale invariance here leads to priors such as p(u) x po* 


for some undermined constant k (Problem 20). (Such a prior is improper; it does 
not integrate to unity, and hence cannot strictly be interpreted as representing our 
actual prior belief.) In general, then, if there is known or assumed invariance — such 
as translation, or for discrete distributions invariance to the sequential order of data 
selection — there will be constraints on the form of the prior. If we can find a prior 
that satisfies such constraints, the resulting prior is “non-informative” with respect 
to that invariance. 

It is tempting to assert that the use of non-informative priors is somehow “ob- 
jective” and lets the data speak for themselves, but such a view is a bit naive. For 
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example, we may seek a non-informative prior when estimating the standard deviation 
o of a Gaussian. But this requirement might not lead to the non-informative prior 
for estimating the variance, 77. Which should we use? In fact, the greatest benefit 
of this approach is that it forces the designer to acknowledge and be clear about the 
assumed invariance — the choice of which generally lies outside our methodology. It 
may be more difficult to accommodate such arbitrary transformations in a maximum 
a posteriori (MAP) estimator (Sec. 3.2.1), and hence considerations of invariance are 
of greatest use in Bayesian estimation, or when the posterior is very strongly peaked 
and the mode not influenced by transformations of the density (Problem 19). 


3.6 *Sufficient Statistics 


From a practical viewpoint, the formal solution provided by Eqs. 26, 51 & 52 is not 
computationally attractive. In pattern recognition applications it is not unusual to 
have dozens or hundreds of parameters and thousands of training samples, which 
makes the direct computation and tabulation of p(D|@) or p(@|D) quite out of the 
question. We shall see in Chap. ?? how neural network methods avoid many of the 
difficulties of setting such a large number of parameters in a classifier, but for now we 
note that the only hope for an analytic, computationally feasible maximum likelihood 
solution lies in being able to find a parametric form for p(x|@) that on the one hand 
matches the characteristics of the problem and on the other hand allows a reasonably 
tractable solution. 

Consider the simplification that occurred in the problem of learning the parameters 
of a multivariate Gaussian density. The basic data processing required was merely 
the computation of the sample mean and sample covariance. This easily computed 
and easily updated statistic contained all the information in the samples relevant to 
estimating the unknown population mean and covariance. One might suspect that 
this simplicity is just one more happy property of the normal distribution, and that 
such good fortune is not likely to occur in other cases. While this is largely true, 
there are distributions for which computationally feasible solutions can be obtained, 
and the key to their simplicity lies in the notion of a sufficient statistic. 

To begin with, any function of the samples is a statistic. Roughly speaking, a 
sufficient statistic is a (possibly vector-valued) function s of the samples D that con- 
tains all of the information relevant to estimating some parameter 0. Intuitively, one 
might expect the definition of a sufficient statistic to involve the requirement that 
p(O|s,D) = p(0|s). However, this would require treating O as a random variable, 
limiting the definition to a Bayesian domain. To avoid such a limitation, the conven- 
tional definition is as follows: A statistic s is said to be sufficient for O if p(D|s,@) is 
independent of 6. If we think of 0 as a random variable, we can write 


p(D|s, 0)p(0|s) 


p(Dls) 
whereupon it becomes evident that p(0|s, D) = p(0|s) if s is sufficient for 0. Con- 
versely, if s is a statistic for which p(0|s, D) = p(0|s), and if p(O|s) 4 0, it is easy to 
show that p(D|s, 0) is independent of O (Problem 27). Thus, the intuitive and the 
conventional definitions are basically equivalent. As one might expect, for a Gaussian 
distribution the sample mean and covariance, taken together, represent a sufficient 
statistic for the true mean and covariance; if these are known, all other statistics 


p(8|s, D) a > (56) 
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such as the mode, range, higher-order moments, number of data points, etc., are 
superfluous when estimating the true mean and covariance. 

A fundamental theorem concerning sufficient statistics is the Factorization Theo- 
rem, which states that s is sufficient for 0 if and only if p(D|@) can be factored into 
the product of two functions, one depending only on s and @, and the other depend- 
ing only on the training samples. The virtue of the Factorization Theorem is that it 
allows us to shift our attention from the rather complicated density p(D|s, 0), used 
to define a sufficient statistic, to the simpler function 


p(D|0) = [ [ p(x). (57) 
k=1 


In addition, the Factorization Theorem makes it clear that the characteristics of a 
sufficient statistic are completely determined by the density p(x|@), and have nothing 
to do with a felicitous choice of an a priori density p(@). A proof of the Factorization 
Theorem in the continuous case is somewhat tricky because degenerate situations are 
involved. Since the proof has some intrinsic interest, however, we include one for the 
simpler discrete case. 


Theorem 3.1 (Factorization) A statistic s is sufficient for O if and only if the 
probability P(D|@) can be written as the product 


P(D|@) = g(s, 0)M(D), (58) 
for some function h(-). 


Proof: 


(a) We begin by showing the “if” part of the theorem. Suppose first that s is sufficient 
for 6, so that P(D|s, 0) is independent of 0. Since we want to show that P(D|@) can 
be factored, our attention is directed toward computing P(D|@) in terms of P(D|s, 0). 
We do this by summing the joint probability P(D,s|@) over all values of s: 


P(D|6) = Y P(D,s|0) 
= Y P(D|s, 0) P(s|0). (59) 


But since s = p(D) for some p(-), there is only one possible value for s for the given 
data, and thus 


P(D|0) = P(D\s, 0)P(s|0). (60) 


Moreover, since by hypothesis P(D|s, 0) is independent of 6, the first factor depends 
only on D. Identifying P(s|@) with g(s,@), we see that P(D|@) factors, as desired. 
(b) We now consider the “only if” part of the theorem. To show that the ability to 
factor P(D|@) as the product g(s,@)h(D) implies that s is sufficient for 8, we must 
show that such a factoring implies that the conditional probability P(D|s, 0) is inde- 
pendent of 6. Because s = y(D), specifying a value for s constrains the possible sets 
of samples to some set D. Formally, D = {D|y(D) = s}. If D is empty, no assignment 
of values to the samples can yield that value of s, and P(s|@) = 0. Excluding such 
cases, i.e., considering only values of s that can arise, we have 
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P(D, s|8) 
P(D|s,0) = ——__— 61 
(PIs, 8) = Eg) (61) 
The denominator can be computed by summing the numerator over all values of D. 
Since the numerator will be zero if D ¢ D, we can restrict the summation to D € D. 
That is, 


P(Dls, 0) P(D|8) g(s,8)h(D) ue) 
PDst = = p0s,0) EPON Esen MD 
DED DED DED DeD 


which is independent of 6. Thus, by definition, s is sufficient for 0. | 


It should be pointed out that there are trivial ways of constructing sufficient 
statistics. For example we can define s to be a vector whose components are the 
n samples themselves: X1,..., Xn. In that case g(s,@) = p(D|@) and h(D) = 1. One 
can even produce a scalar sufficient statistic by the trick of interleaving the digits 
in the decimal expansion of the components of the n samples. Sufficient statistics 
such as these are of little interest, since they do not provide us with simpler results. 
The ability to factor p(D|@) into a product g(s,@)h(D) is interesting only when the 
function g and the sufficient statistic s are simple. It should be noted that sufficiency 
is an integral notion. That is, if s is a sufficient statistic for O, this does not necessarily 
imply that their corresponding components are sufficient, i.e., that sı is sufficient for 
01, or s2 for 02, and so on (Problem 26). 

An obvious fact should also be mentioned: the factoring of p(D|@) into g(s, O)h(D) 
is not unique. If f(s) is any function of s, then y (s,0) = f(s)g(s,@) and h’(D) = 
h(D)/f(s) are equivalent factors. This kind of ambiguity can be eliminated by defining 
the kernel density 


als.) => zap) (63) 


g(s,0) d0 
which is invariant to this kind of scaling. 

What is the importance of sufficient statistics and kernel densities for parameter 
estimation? The general answer is that the most practical applications of classical 
parameter estimation to pattern classification involve density functions that possess 
simple sufficient statistics and simple kernel densities. Moreover, it can be shown 
that for any clasification rule, we can find another based solely on sufficient statistics 
that has equal or better performance. Thus — in principle at least — we need only 
consider decisions based on sufficient statistics. It is, in essence, the ultimate in data 
reduction: we can reduce an extremely large data set down to a few numbers — the 
sufficient statistics — confident that all relevant information has been preserved. This 
means, too, that we can always create the Bayes classifier from sufficient statistics, as 
for example our Bayes classifiers for Gaussian distributions were functions solely of 
the sufficient statistics, estimates of u and >. 

In the case of maximum likelihood estimation, when searching for a value of 0 
that maximizes p(D|0) = g(s, 0)h(D), we can restrict our attention to g(s, 0). In this 
case, the normalization provided by Eq. 63 is of no particular value unless g(s, 0) is 
simpler than g(s, 0). The significance of the kernel density is revealed however in the 
Bayesian case. If we substitute p(D|@) = g(s,@)h(D) in Eq. 51, we obtain 
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g(s, 0)p(0) 
J 9(s,@)p(@) de” 


If our prior knowledge of O is very vague, p(@) will tend to be uniform, or changing 
very slowly as a function of 6. For such an essentially uniform p(0), Eq. 64 shows 
that p(@|D) is approximately the same as the kernel density. Roughly speaking, the 
kernel density is the posterior distribution of the parameter vector when the prior 
distribution is uniform. Even when the a priori distribution is far from uniform, the 
kernel density typically gives the asymptotic distribution of the parameter vector. In 
particular, when p(x|@) is identifiable and when the number of samples is large, g(s, 0) 
usually peaks sharply at some value 0 = 6. If the a priori density p(0) is continuous 
at 0 = @ and if p(@) is not zero, p(@|/D) will approach the kernel density g(s, 0). 


p(0|D) = (64) 


3.6.1 Sufficient Statistics and the Exponential Family 


To see how the Factorization Theorem can be used to obtain sufficient statistics, 
consider once again the familiar d-dimensional normal case with fixed covariance but 
unknown mean, i.e., p(x|0) ~ N(0, Xx). Here we have 
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(65) 
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This factoring isolates the O dependence of p(D|@) in the first term, and hence from 
the Factorization Theorem we conclude that `}; xx is sufficient for 6. Of course, 
any one-to-one function of this statistic is also sufficient for 0; in particular, the sample 
mean 


+ 
fin ==) xr (66) 
k=1 
is also sufficient for 0. Using this statistic, we can write 


Sir, 0) = exp | - Z (0'870 — 26'S-"A,,)]. (67) 


From using Eq. 63, or by completing the square, we can obtain the kernel density: 


1 1 LOSE 
(ft, 0) = |- 350- An (ZE) 0- ôn): 
These results make it immediately clear that f1,, is the maximum likelihood estimate 


for 0. The Bayesian posterior density can be obtained from g(ft,,,9) by performing 
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the integration indicated in Eq. 64. If the a priori density is essentally uniform, 
p(0|D) = (itn, 0). 

This same general approach can be used to find sufficient statistics for other density 
functions. In particular, it applies to any member of the exponential family, a group 
of probability and probability density functions that possess simple sufficient statis- 
tics. Members of the exponential family include the Gaussian, exponential, Rayleigh, 
Poisson, and many other familiar distributions. They can all be written in the form 


p(x|8) = a(x) exp [a(8) + b(0)'c(x)]. (69) 
If we multiply n terms of the form in Eq. 69 we find 


n 


p(D|9) = exp [na(8) + b()' S>e(xx)] TI en) = 9(s,9)M(D), (70) 


k=1 k=1 


where we can take 


9(s.9) = exp [nfa(9) + b(9)'s}], 


and 


The distributions, sufficient statistics, and unnormalized kernels for a number of 
commonly encountered members of the exponential family are given in Table ??. 
It is a fairly routine matter to derive maximum likelihood estimates and Bayesian 
a posteriori distributions from these solutions. With two exceptions, the solutions 
given are for univariate cases, though they can be used in multivariate situations if 
statistical independence can be assumed. Note that a few well-known probability 
distributions, such as the Cauchy, do not have sufficient statistics, so that the sample 
mean can be a very poor estimator of the true mean (Problem 28). 
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Table 3.1: Common Exponential Distributions and their Sufficient Statistics. 


Name Distribution Domain s [g(s, 0)” 
1 n 
p(2|0) = m DE 
Normal y meteo? 0,>0 i a , Oze 292(82—261 81+ 97) 
27 a y Tk 
k=1 
: 1 
Multi- p(x|0) = O» a 2 Xk |O2|1/2e7 2ltrOrs. 
variate 19,12 -(1/2(x-0,)0,(x-0,) positive n 20° @.5,4+0' O20 
Normal (21)277 © i i definite ED xxt, pees ate 
k=1 
Exponential Pe ge E 0 0>0 4 2 Tk ge 
{ 0 otherwise 
; p(x/0) = Lo 22 —0s 
Rayleigh Mac? y >0 0>0 = 2 Ti de 
0 otherwise 
p(a|6) = na | 
3/2,—0Os 
Maxwell O Es 9>0 = 2 xi 03/26 
0 otherwise 
1/n 
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otherwise = Y) Tr 
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0 otherwise Mu z a) 
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Bernoulli P(x|0) = 67(1—0)'-* z=0,1 0<0<1 T=- LY wp 6s(1 —6)1-§ 
k= 
P(x|0) = . 7 
Binomial AA q 0<0<1 | t mw LN 2 6s(1—ay™§ 
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3.7 Problems of Dimensionality 


In practical multicategory applications, it is not at all unusual to encounter problems 
involving fifty or a hundred features, particularly if the features are binary valued. 
We might typically believe that each feature is useful for at least some of the discrim- 
inations; while we may doubt that each feature provides independent information, 
intentionally superfluous features have not been included. There are two issues that 
must be confronted. The most important is how classification accuracy depends upon 
the dimensionality (and amount of training data); the second is the computational 
complexity of designing the classifier. 


3.7.1 Accuracy, Dimension, and Training Sample Size 


If the features are statistically independent, there are some theoretical results that 
suggest the possibility of excellent performance. For example, consider the two-class 
multivariate normal case with the same covariance where p(x|wj;) ~ N(u;, 2), j = 
1,2. If the a priori probabilities are equal, then it is not hard to show (Chap. ??, 
Problem ??) that the Bayes error rate is given by 


Bl = I Pd, (71) 


r/2 


where r? is the squared Mahalanobis distance (Chap. ??, Sect. ??): 


r? = (p — 9) E (1, — Ho). (72) 


Thus, the probability of error decreases as r increases, approaching zero as r ap- 
proaches infinity. In the conditionally independent case, © = diag(o%, ..., 0%), and 


d 
2 [bia — Hi2 y? 

we (a 
This shows how each feature contributes to reducing the probability of error. 
Naturally, the most useful features are the ones for which the difference between the 
means is large relative to the standard deviations. However no feature is useless if its 
means for the two classes differ. An obvious way to reduce the error rate further is to 
introduce new, independent features. Each new feature need not add much, but if r 
can be increased without limit, the probability of error can be made arbitrarily small. 
In general, if the performance obtained with a given set of features is inadequate, 
it is natural to consider adding new features, particularly ones that will help separate 
the class pairs most frequently confused. Although increasing the number of features 
increases the cost and complexity of both the feature extractor and the classifier, it 
is often reasonable to believe that the performance will improve. After all, if the 
probabilistic structure of the problem were completely known, the Bayes risk could 
not possibly be increased by adding new features. At worst, the Bayes classifer would 
ignore the new features, but if the new features provide any additional information, 

the performance must improve (Fig. 3.3). 
Unfortunately, it has frequently been observed in practice that, beyond a certain 
point, the inclusion of additional features leads to worse rather than better perfor- 
mance. This apparent paradox presents a genuine and serious problem for classifier 


ORDER 
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Figure 3.3: Two three-dimensional distributions have nonoverlapping densities, and 
thus in three dimensions the Bayes error vanishes. When projected to a subspace — 
here, the two-dimensional x; — x2 subspace or a one-dimensional zı subspace — there 
can be greater overlap of the projected distributions, and hence greater Bayes errors. 


design. The basic source of the difficulty can always be traced to the fact that we 
have the wrong model — e.g., the Gaussian assumption or conditional assumption 
are wrong — or the number of design or training samples is finite and thus the dis- 
tributions are not estimated accurately. However, analysis of the problem is both 
challenging and subtle. Simple cases do not exhibit the experimentally observed phe- 
nomena, and more realistic cases are difficult to analyze. In an attempt to provide 
some rigor, we shall return to topics related to problems of dimensionality and sample 
size in Chap. ??. 


3.7.2 Computational Complexity 


We have mentioned that one consideration affecting our design methodology is that of 
the computational difficulty, and here the technical notion of computational complex- 
ity can be useful. First, we will will need to understand the notion of the order of a 
function f(x): we say that the f(x) is “of the order of h(a)” — written f(x) = O(h(x)) 
and generally read “big oh of h(a)” — if there exist constants co and zo such that 
|f(x)| < colh(x)| for all x > a. This means simply that for sufficiently large x, 
an upper bound on the function grows no worse than h(x). For instance, suppose 
f(x) = ao + a,x + azz?; in that case we have f(x) = O(a?) because for sufficiently 
large x, the constant, linear and quadratic terms can be “overcome” by proper choice 
of cy and xy. The generalization to functions of two or more variables is straightfor- 
ward. It should be clear that by the definition above, the big oh order of a function is 
not unique. For instance, we can describe our particular f(x) as being O(x?), O(x?), 
O(a*), O(2? In z). 

Because of the non-uniqueness of the big oh notation, we occasionally need to be 
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more precise in describing the order of a function. We say that f(x) = O(h(x)) “big 
theta of h(a)” if there are constants xo, cı and ca such that for x > zo, f(x) always 
lies between c, h(a) and cah(x). Thus our simple quadratic function above would obey 
f(x) = O(a), but would not obey f(x) = O(a). (A fuller explanation is provided 
in the Appendix.) 

In describing the computational complexity of an algorithm we are generally inter- 
ested in the number of basic mathematical operations, such as additions, multiplica- 
tions and divisions it requires, or in the time and memory needed on a computer. To 
illustrate this concept we consider the complexity of a maximum likelihood estimation 
of the parameters in a classifier for Gaussian priors in d dimensions, with n training 
samples for each of c categories. For each category it is necessary to calculate the 
discriminant function of Eq. 74, below. The computational complexity of finding the 
sample mean ft is O(nd), since for each of the d dimensions we must add n component 
values. The required division by n in the mean calculation is a single computation, 
independent of the number of points, and hence does not affect this complexity. For 
each of the d(d + 1)/2 independent components of the sample covariance matrix Y 
there are n multiplications and additions (Eq. 19), giving a complexity of O(d?n). 
Once $ has been computed, its determinant is an O(d?) calculation, as we can easily 
verify by counting the number of operations in matrix “sweep” methods. The inverse 
can be calculated in O(d?) calculations, for instance by Gaussian elimination.* The 
complexity of estimating P(w) is of course O(n). Equation 74 illustrates these indi- 
vidual components for the problem of setting the parameters of normal distributions 
via maximum lielihood: 


ny O(na? o) O(d?n) 
O(dn) O(nd") O N O(n) 


1 Io act . d 1 a l Ae 
g(x) = -537 Ê) x (x—p) 5 In 2a 5 In |2| +n Pw). (74) 


Naturally we assume that n > d (otherwise our covariance matrix will not have a 
well defined inverse), and thus for large problems the overall complexity of calculating 
an individual discriminant function is dominated by the O(d?n) term in Eq. 74. This 
is done for each of the categories, and hence our overall computational complexity 
for learning in this Bayes classifer is O(cd?n). Since c is typically a constant much 
smaller than d? or n, we can call our complexity O(d?n). We saw in Sect. 3.7 that it 
was generally desirable to have more training data from a larger dimensional space; 
our complexity analysis shows the steep cost in so doing. 

We next reconsider the matter of estimating a covariance matrix in a bit more 
detail. This requires the estimation of d(d+1)/2 parameters — the d diagonal elements 
and d(d—1)/2 independent off-diagonal elements. We observe first that the appealing 
maximum likelihood estimate 


n 


a 1 
D= Xk — Mn) (Xk —m,,)’, 75 
2 X ) (75) 
is an O(nd?) calculation, is the sum of n — 1 independent d-by-d matrices of rank one, 
and thus is guaranteed to be singular if n < d. Since we must invert S to obtain the 
discriminant functions, we have an algebraic requirement for at least d + 1 samples. 
To smooth our statistical fluctuations and obtain a really good estimate, it would not 
be surprising if several times that number of samples were needed. 


* We mention for the afficionado that there are more complex matrix inversion algorithms that are 
O(d2:376..-). and there may be algorithms with even lower complexity yet to be discovered. 
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The computational complexity for classification is less, of course. Given a test 
point x we must compute (x — 1), an O(d) calculation. Moreover, for each of the 
categories we must multiply the inverse covariance matrix by the separation vector, 
an O(d?) calculation. The max;g;(x) decision is a separate O(c) operation. For small 
c then, recall is an O(d?) operation. Here, as throughout virtually all pattern clas- 
sification, recall is much simpler (and faster) than learning. The complexity of the 
corresponding case for Bayesian learning, summarized in Eq. 49, yields the same com- 
putational complexity as in maximum likelihood. More generally, however, Bayesian 
learning has higher complexity as a consequence of integrating over model parameters 
0. 

Such a rough analysis did not tell us the constants of proportionality. For a finite 
size problem it is possible (though not particularly likely) that a particular O(n?) 
algorithm is simpler than a particular O(n?) algorithm, and it is occasionally necessary 
for us to determine these constants to find which of several implemementations is the 
simplest. Nevertheless, big oh and big theta analyses, as just described, are generally 
the best way to describe the computational complexity of an algorithm. 

Sometimes we stress space and time complexities, which are particularly relevant 
when contemplating parallel implementations. For instance, the sample mean of a 
category could be calculated with d separate processors, each adding n sample values. 
Thus we can describe this implementation as O(d) in space (i.e., the amount of memory 
or possibly the number of processors) and O(n) in time (i.e., number of sequential 
steps). Of course for any particular algorithm there may be a number of time-space 
tradeoffs, for instance using a single processor many times, or using many processors 
in parallel for a shorter time. Such tradeoffs are important considerations can be 
important in neural network implementations, as we shall see in Chap. ??. 

A common qualitative distinction is made between polynomially complex and ez- 
ponentially complex algorithms — O(a") for some constant a and aspect or variable k 
of the problem. Exponential algorithms are generally so complex that for reasonable 
size cases we avoid them altogether, and resign ourselves to approximate solutions 
that can be found by polynomially complex algorithms. 


3.7.3 Overfitting 


It frequently happens that the number of available samples is inadequate, and the 
question of how to proceed arises. One possibility is to reduce the dimensionality, 
either by redesigning the feature extractor, by selecting an appropriate subset of the 
existing features, or by combining the existing features in some way (Chap ??). An- 
other possibility is to assume that all c classes share the same covariance matrix, and 
to pool the available data. Yet another alternative is to look for a better estimate for 
2. If any reasonable a priori estimate Xo is available, a Bayesian or pseudo-Bayesian 
estimate of the form AXo + (1 — A) might be employed. If No is diagonal, this 
diminishes the troublesome effects of “accidental” correlations. Alternatively, one can 
remove chance correlations heuristically by thresholding the sample covariance matrix. 
For example, one might assume that all covariances for which the magnitude of the 
correlation coefficient is not near unity are actually zero. An extreme of this approach 
is to assume statistical independence, thereby making all the off-diagonal elements be 
zero, regardless of empirical evidence to the contrary — an O(nd) calculation. Even 
though such assumptions are almost surely incorrect, the resulting heuristic estimates 
sometimes provide better performance than the maximum likelihood estimate of the 
full parameter space. 
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Here we have another apparent paradox. The classifier that results from assuming 
independence is almost certainly suboptimal. It is understandable that it will perform 
better if it happens that the features actually are independent, but how can it provide 
better performance when this assumption is untrue? The answer again involves the 
problem of insufficient data, and some insight into its nature can be gained from 
considering an analogous problem in curve fitting. Figure 3.4 shows a set of ten data 
points and two candidate curves for fitting them. The data points were obtained 
by adding zero-mean, independent noise to a parabola. Thus, of all the possible 
polynomials, presumably a parabola would provide the best fit, assuming that we are 
interested in fitting data obtained in the future as well as the points at hand. Even 
a straight line could fit the training data fairly well. The parabola provides a better 
fit, but one might wonder whether the data are adequate to fix the curve. The best 
parabola for a larger data set might be quite different, and over the interval shown 
the straight line could easily be superior. The tenth-degree polynomial fits the given 
data perfectly. However, we do not expect that a tenth-degree polynomial is required 
here. In general, reliable interpolation or extrapolation can not be obtained unless 
the solution is overdetermined, i.e., there are more points than function parameters 
to be set. 


f(x) 
A 


10; 
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Figure 3.4: The “training data” (black dots) were selected from a quadradic function 
plus Gaussian noise, i.e., f(x) = ax? + bx + c+ e where p(e) ~ N(0,0?). The 10th 
degree polynomial shown fits the data perfectly, but we desire instead the second-order 
function f(x), since it would lead to better predictions for new samples. 


In fitting the points in Fig. 3.4, then, we might consider beginning with a high- 
order polynomial (e.g., 10th order), and successively smoothing or simplifying our 
model by eliminating the highest-order terms. While this would in virtually all cases 
lead to greater error on the “training data,” we might expect the generalization to 
improve. 

Analogously, there are a number of heuristic methods that can be applied in 
the Gaussian classifier case. For instance, suppose we wish to design a classifier 
for distributions N(j;, 21) and N(p,, 42) and we have reason to believe that we 
have insufficient data for accurately estimating the parameters. We might make the 
simplification that they have the same covariance, i.e., N (u, 2) and N(p), 2), and 
estimate X accordingly. Such estimation requires proper normalization of the data 
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(Problem 36). 

An intermediate approach is to assume a weighted combination of the equal and 
individual covariances, a technique known as shrinkage, (also called regularized dis- 
criminant analysis) since the individual covariances “shrink” toward a common one. 
If i is an index on the c categories in question, we have 


1—a)n;d; + and 
(l—a)n; + an 


zila) = | | (76) 
for0 <a < 1. Additionally, we could “shrink” the estimate of the (assumed) common 
covariance matrix toward the identity matrix, as 


2(6) = (1 — 6)2 + PL, (77) 


for 0 < 6 < 1 (Computer exercise 8). (Such methods for simplifying classifiers have 
counterparts in regression, generally known as ridge regression.) 

Our short, intuitive descussion here will have to suffice until Chap. ??, where we 
will explore the crucial issue of controlling the complexity or expressive power of a 
classifer for optimum performance. 


3.8 *Expectation-Maximization (EM) 


We saw in Chap. ?? Sec. ?? how we could classify a test point even when it has miss- 
ing features. We can now extend our application of maximum likelihood techniques 
to permit the learning of parameters governing a distribution from training points, 
some of which have missing features. If we had uncorrupted data, we could use maxi- 
mum likelihood, i.e., find @ that maximized the log-likelihood 1(@). The basic idea in 
the expectation maximization or EM algorithm, is to iteratively estimate the likeli- 
hood given the data that is present. The method has precursors in the Baum-Welch 
algorithm we will consider in Sec. 3.10.6. 

Consider a full sample D = [x1,..., Xn } of points taken from a single distribution. 
Suppose, though, that here some features are missing; thus any sample point can 
be written as x, = {Xxg,Xxp}, 1.e., comprising the “good” features and the missing, 
or “bad” ones (Chapt. ??, Sect. ??). For notational convenience we separate these 
individual features (not samples) into two sets, D, and D, with D = D, U Dy being 
the union of such features. 

Next we form the function 


Q(0; 6°) = Ep, [In p(D,, Dv; 8) |Dy; 0°, (78) 


where the use of the semicolon denotes, for instance on the left hand side, that 
Q(0; 0%) is a function of @ with 0” assumed fixed; on the right hand side it de- 
notes that the expected value is over the missing features assuming 6’ are the true 
parameters describing the (full) distribution. The simplest way to interpret this, the 
central equation in expectation maximization, is the following. The parameter vector 
0' is the current (best) estimate for the full distribution; @ is a candidate vector for 
an improved estimate. Given such a candidate 0, the right hand side of Eq. 78 calcu- 
lates the likelihood of the data, including the unknown feature Dy marginalized with 
respect to the current best distribution, which is described by 6’. Different candidate 
Os will of course lead to different such likelihoods. Our algorithm will select the best 
such candidate O and call it @’*' — the one corresponding to the greatest Q(0; 0”). 
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If we continue to let 7 be an interation counter, and now let T be a preset conver- 
gence criterion, our algorithm is as follows and illustrated in Fig. 3.5: 


Algorithm 1 (Expectation-Maximization) 


1 begin initialize 0%, T,i=0 


2 doi i+1 

3 E step : compute Q(0; 0°) l 

5 M step: 6+! — arg max Q (0; 0°) 
6 until Q(6***; 6°) — Q(6*; 07t) < T 


gil 


7 return 0 — 
8 end 


e? 


Figure 3.5: The search for the best model via the EM algorithm starts with some 
initial value of the model parameters, 0%. Then, via the M step the optimal 0! 
is found. Next, 0t is held constant and the value 0? found which optimizes Q(-,-). 
This process iterates until no value of 0 can be found that will increase Q(-,-). Note 
in particular that this is different from a gradient search. For example here 0! is 
the global optimum (given fixed 0%), and would not necessarily have been found via 
gradient search. (In this illustration, Q(-, -) is shown symmetric in its arguments; this 
need not be the case in general, however.) 


This so-called Expectation-Maximization or EM algorithm is most useful when the 
optimization of Q(-,-) is simpler than that of l(-). Most importantly, the algorithm 
guarantees that the log-likelihood of the good data (with the bad data marginalized) 
will increase monotonically, as explored in Problem 37. This is not the same as 
finding the particular value of the bad data that gives the maximum likelihood of the 
full (completed) data, as can be seen in Example 2. 


Example 2: Expectation-Maximization for a 2D normal model | 


Suppose our data consists of four points in two dimensions, one point of which 


is missing a feature: D = {xX1, X2, X3, X4} = 1G) ae Sk A where * represents 


the unknown value of the first feature of point x4. Thus our bad data D, consists of 
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the single feature x41, and the good data D, all the rest. We assume our model is a 
Gaussian with diagonal covariance and arbitrary mean, and thus can be described by 
the parameter vector 


We take our initial guess to be a Gaussian centered on the origin having & = I, that 
is: 


g? = 


=. = © © 


In finding our first improved estimate, 0*, we must calculate Q(0, 0°) or, by Eq. 78, 


Q(0; 0%) = Ez, [In p(Xg, xs; 010°; D,)] 


or 8 


= fe Inp(xx|@) + nro) p(x4110% x42 = 4) día 


o Lk=1 


3 lee) 
= 2 Inp(xp10)) +-finp (E> 
k= 


—= 00 


o) p ((#)10°) 


Pole (=) J) ae 


where 241 is the unknown first feature of point x4, and K is a constant that can be 
brought out of the integral. We focus on the integral, substitute the equation for a 
general Gaussian, and find 


Q(0; 0°) 


l 
le 
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This completes the expectation or E step. Through a straightforward calculation, 
we find the values of O (that is, 11, ua, 01 and 02 that maximize Q(, -), to get the next 
estimate: 
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This new mean and the 1/e ellipse of the new covariance matrix are shown in the figure. 
Subsequent iterations are conceptually the same, but require a bit more extensive 
calculation. The mean will remain at u = 2. After three iterations the algorithm 


converges at the solution u = (36), and Y = ae z0): 
Xo 
l 
Oo 
—— X 
0 1 2 : 


The four data points, one of which is missing the value of x; component, are shown 
in red. The initial estimate is a circularly symmetric Gaussian, centered on the 
origin (gray). (A better initial estimate could have been derived from the three 
known points.) Each iteration leads to an improved estimate, labelled by the iteration 
number 7; here, after three iterations the algorithm has converged. 


We must be careful and note that the EM algorithm leads to the greatest log- 
likelihood of the good data, with the bad data marginalized. There may be particular 
values of the bad data that give a different solution and an even greater log-likelihood. 
For instance, in this Example if the missing feature had value 141 = 2, so that 
X4 = È), we would have a solution 


1.0 
2.0 
0.5 
2.0 


and a log-likelihood for the full data (good plus bad) that is greater than for the good 
alone. Such an optimization, however, is not the goal of the canonical EM algorithm. 
Note too that if no data is missing, the calculation of Q(0; 0°) is simple since no 
integrals are involved. 


Generalized Expectation-Maximization or GEM algorithms are a bit more lax than 
the EM algorithm, and require merely that an improved 0+! be set in the M step 
(line 5) of the algorithm — not necessarily the optimal. Naturally, convergence will 
not be as rapid as for a proper EM algorithm, but GEM algorithms afford greater 
freedom to choose computationally simpler steps. One version of GEM is to find the 
maximum likelihood value of unknown features at each iteration step, then recalculate 
0 in light of these new values — if indeed they lead to a greater likelihood. 


GENERALIZED 
EXPECTATION- 
MAXIMIZATION 
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In practice, the term Expectation-Maximization has come to mean loosely any 
iterative scheme in which the likelihood of some data increases with each step, even if 
such methods are not, technically speaking, the true EM algorithm as presented here. 


3.9 Bayesian Belief Networks 


The methods we have described up to now are fairly general — all that we assumed, 
at base, was that we could parameterize the distributions by a feature vector 0. If we 
had prior information about the distribution of 6, this too could be used. Sometimes 
our knowledge about a distribution is not directly of this type, but instead about 
the statistical dependencies (or independencies) among the component features. Re- 
call that for some multidimensional distribution p(x), if for two features we have 
p(zi, £j) = p(xi)p(x;), we say those variables are statistically independent (Fig. 3.6). 


x3 
1 


x“ 


Figure 3.6: A three-dimensional distribution which obeys p(x1, 73) = p(x1)p(a3); thus 
here x; and 23 are statistically independent but the other feature pairs are not. 


There are many cases where we know or can safely assume which variables are 
or are not independent, even without sampled data. Suppose for instance we are 
describing the state of an automobile — temperature of the engine, pressures of the 
fluids and in the tires, voltages in the wires, and so on. Our basic knowledge of cars 
includes the fact that the oil pressure in the engine and the air pressure in a tire are 
functionally unrelated, and hence can be safely assumed to be statistically indepen- 
dent. However the oil temperature and engine temperature are not independent (but 
could be conditionally independent). Furthermore we may know several variables that 
might influence another: the coolant temperature is affected by the engine tempera- 
ture, the speed of the radiator fan (which blows air over the coolant-filled radiator), 
and so on. 

We will represent these dependencies graphically, by means of Bayesian belief nets, 
also called causal networks, or simply belief nets. They take the topological form of a 
directed acyclic graph (DAG), where each link is directional, and there are no loops. 
(More general networks permit such loops, however.) While such nets can represent 
continuous multidimensional distributions, they have enjoyed greatest application and 
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success for discrete variables. For this reason, and because the formal properties are 


simpler, we shall concentrate on the discrete case. 
P(a) P(b) 


P(d|b) 


y NS 


P(gif) 


Figure 3.7: A belief network consists of nodes (labelled with upper case bold letters) 
and their associated discrete states (in lower-case). Thus node A has states a1, 42, 
..., denoted simply a; node B has states b,, ba, ..., denoted b, and so forth. The 
links between nodes represent conditional probabilities. For example, P(c|a) can be 
described by a matrix whose entries are P(c;|a;). 


Each node (or unit) represents one of the system variables, and here takes on 
discrete values. We will label nodes with A, B, ..., and the variables at each node 
by the corresponding lower-case letter. Thus, while there are a discrete number of 
possible values of node A — here two, a; and az — there may be continuous-valued 
probabilities on these discrete states. For example, if node A represents the state of 
a binary lamp switch — a, = on, a2 = off — we might have P(a¡) = 0.739, P(a2) = 
0.261, or indeed any other probabilities. A link joining node A to node C in Fig. 3.7 
is directional, and represents the conditional probabilities P(c;|a;), or simply P(cla). 
For the time being we shall not be concerned with how these conditional probabilities 
are determined, except to note that in some cases human experts provide the values. 

Suppose we have a belief net, complete with conditional probabilities, and know 
the values or probabilities of some of the states. Through careful application of Bayes 
rule or Bayesian inference, we will be able to determine the maximum posterior value 
of the unknown variables in the net. We first consider how to determine the state 
of just one node from the states in units with which it is connected. The connected 
nodes are the only ones we need to consider directly — the others are conditionally 
independent. This is, at base, the simplification provided by our knowledge of the 
dependency structure of the system. 

In considering a single node X in the simple net of Fig. 3.8, it is extremely useful 
to distinguish the set of nodes before X — called its parents P — and the set of those 
after it — called its children C. When we evaluate the probabilities at X, we must 
treat the parents of X differently from its children. Thus, in Fig. 3.8, A and B are in 
P of X while C and D are in C. 

The belief of a set of propositions x = (x1, £2,...) on node X describes the relative 
probabilities of the variables given all the evidence e throughout the rest of the net- 
work, i.e., P(xJe).* We can divide the dependency of the belief upon the parents and 


* While this is sometimes denoted BEL(x), we keep a notation that clarifies the dependencies and 
is more similar to that in our previous discussions. 


NODE 
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(a) Parents of X 


©) O) Children of X 


Figure 3.8: A portion of a belief network, consistsing of a node X, having variable 
values (11, %,...), its parents (A and B), and its children (C and D). 


the children in the following way: 


P(x|e) x P(e°|x)P(x|e”), (79) 


where e represents all evidence (i.e., values of variables on nodes other than X), e” 
the evidence on the parent nodes, and el the children nodes. In Eq. 79 we show only 
a proportionality — at the end of our calculation we will normalize the probabilities 
over the states at X. 

The first term in Eq. 79 is quite simple, and is a manifestation of Bayes’ formula. 
We can expand the dependency upon the evidence of the children nodes as follows: 


P(e&|x) = P66, €cz, <, C0; |X) 
= Plec, |x) P(e, |x) --+ P(eca lx) 
Ic| 
= [[Pec,1%), (80) 


where Cj represents the jth child node and ec, the values of the probabilities of 
its states. Note too our convention that |C| denotes the cardinality of set C — the 
number of elements in the set — a convenient notation for indicating the full range of 
summations or products. In the last step of Eq. 80 we used our knowledge that since 
the child nodes cannot be joined by a line, then they are conditionally independent 
given x. Equation 80 simply states that the probability of a given set of states 
throughout all the children nodes of X is the product of the (independent) probabilities 
in the individual children nodes. For instance, in the simple example in Fig. 3.8, we 
have 


P(ec, ep|x) = P(ec|x)P(ep|x). (81) 


Incorporating evidence from parent nodes is a bit more subtle. We have: 


P(x|e”?) = P(x|ep, , Pass 8.) 
= 5 P(xIPri, Poj, <-> Pipir) P (Pri, Pog «<-> Pp rl ep, ) +--+» €P ip) ) 
Bil tik 


= 5 P(x|Pri, Poj, -- Pipe) P (Prilep) P(Pipjelepip).)> (82) 
all iF. 
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where the summation is over all possible configurations of values on the different 
parent nodes. Here Pmn denotes a particular value for state n on parent node Pm- 
In the last step of Eq. 82 we have again used our assumption that the (unconnected) 
parent nodes are statistically independent. 


While Eq. 82 and its unavoidable notational complexities may appear intimidating, 
it is actually just a logical consequence of Bayes’ rule. For the purposes of clarity and 
for computing x, each term at the extreme right, P(P;¡lep,) can be considered to be 
P(P1;) — the probability of state ¿ on the first parent node. Our notation shows that 
this probability depends upon the evidence at Pı, including from its parents, but for 
the sake of computing the probabilities at X we temporarily ignore the dependencies 
beyond the parents and children of X. 


Thus we rewrite Eq. 82 as 


[P| 


P(xje?) = y P(x|Pmn) | | PP; ler.) (83) 


all Pinn ¿=1 


We put these results together for the general case with |P| parent nodes and |C| 
children nodes, Eqs. 80 & 83, and find 


[Cl IP] 
P(xle) x | [| Plec,|x) | XO PEPmn) | [ P(Piler.)| - (84) 
j=l all Pmn i=1 
P(eC|x) P(x|e?) 


In words, Eq. 84 states that the probability of a particular values for node X is 
the product of two factors. The first is due to the children (the product of their 
independent likelihoods). The second is the sum over all possible configurations of 
states on the parent nodes of the prior probabilities of their values and the conditional 
probabilities of the x variables given those parent values. The final values must be 
normalized to represent probabilities. 


Example 3: Belief network for fish | 


Suppose we are again interested in classifying fish, but now we want to use more 
information. Imagine that a human expert has constructed the simple belief network 
in the figure, where node A represents the time of year, and can have four values: 
a, = winter, dg = spring, ag = summer and a4 = autumn. Node B represents the 
geographical area where the fish was caught: bı = north Atlantic and bo = south 
Atlantic. A and B are the parents of the node X, which represents the fish and has 
just two possible values: x; = salmon and x2 = sea bass. Similarly, our expert tells 
us that the children nodes represent lightness, C, with cı = dark, c2 = medium and 
c3 = light as well as thickness, D, with d, = thick and da = thin. The direction of the 
links (from A and B to X and likewise from X to C and D) is meant to describe the 
influences among the variables, as shown in the figure. 


E E +4 ¿ 
a, = summer locale} b, = south Atlantic 


a, = autumn 


P(x]b) 


x, = salmon 
x, = sea bass 
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OD AND BAYESIAN ESTIMATION 


d, = wide 
d, = thin 


P(e 
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c, = light 
c, = medium 
c, = dark J 


A simple belief net for the fish example. The season and the fishing locale are statisti- 
cally independent, but the type of fish caught does depend on these factors. Further, 
the width of the fish and its color depend upon the fish. 


IHG 


The following probability matrixes (here, given by an expert) describe the influence 
of time of year and fishing area on the identity of the fish: 


salmon sea bass 


winter 9 1 salmon sea bass 
yy, Spring 3 at una. north 65 39 
P(aila;): summer A 6 E P(ailbj) : south .25 T9 
autumn 8 2 


Thus salmon are best found in the north fishing areas in the winter and autumn, 
sea bass in the south fishing areas in the spring and summer, and so forth. Recall 
that in our belief networks the variables are discrete, and all influences are cast as 
probabilites, rather than probability densities. Given that we have any particular 
feature value on a parent node, we must have some fish; thus each row is normalized, 
as for instance P(x1|a1) + P(z2ļa1) = 1. 

Suppose our expert tells us that the conditional probabilities for the variables in 
the children nodes are as follows: 


light medium dark wide thin 
salmon 33 33 34 salmon A 6 
Plcile;): sea bass ( 8 ‘il ll | P(di|æ;) : sea bass ( .95 .05 ) 


Thus salmon come in the full range of lightnesses, while sea bass are primarily light 
in color and are primarily wide. 

Now we turn to the problem of using such a belief net to infer the identity 
of a fish. We have no direct information about the identity of the fish, and thus 
P(x1) = P(x2) = 0.5. This might be a reasonable starting point, expressing our lack 
of knowledge of the identity of the fish. Our goal now is to estimate the probabilities 
P(a,\e) and P(a2\e). Note that without any evidence we have 


P(zi) = 5 P(x1, ai, bj, Ck, di) 

i,j, k,l 
= J P(a)P(b;)P(w1 |i, by) P (cela) P(dilar) 

1,j,k,l 
= Y P(a¡)P(b;)P(x1]a;, bj) 

ij 
= (0.25)(0.5) Y * P(x1]as, by) 

ij 

= (0.25)(0.5)(0.9 + 0.3 + 0.4 + 0.7 + 0.8 + 0.2 + 0.1 + 0.6) 
= 0.5, 


and thus P(x1) = P(x2), as we would expect. 
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Now we collect evidence for each node, (ea,eB,ec,ep), assuming they are in- 
dependent of each other. Suppose we know that it is winter, i.e., P(aijea) = 1 and 
P(a¡lea) = 0 for i = 2,3, 4. Suppose we do not know which fishing area the boat came 
from but found that the particular fishing crew prefers to fish in the south Atlantic; 
we assume, then, that P(b1[eB) = 0.2 and P(b2|eg) = 0.8. We measure the fish and 
find that it is fairly light, and set by hand to be P(ec|ci) = 1, P(ec|c2) = 0.5, and 
P(ec|c3) = 0. Suppose that due to occlusion, we cannot measure the width of the 
fish; we thus set P(ep|d,) = P(ep|d2). 

By Eq. 82, we have the estimated probability of each fish due to the parents P is, 
in full expanded form 


Pp(a1) x P(ai\a1,b1)P(a1)P(b1) 
+P(a1 a1, b2 P ay 
+P(x1|a2,b1)P(a2 


+P Ti a2, b2 P ag 


A similar calculation gives Pp(x2) = 0.18. 
We now turn to the children nodes and find by Eq. 84 


Pe(ai) x Plec|x1)P(ep|21) 
= [Pleclei)P(ei|z1) + Pleclcz)Plcalx1) + Pleclcs)P(c3|x1)] 
x[P(ep|d1) P(di|x1) + P(ep|d2)P(d2|£1)] 
= [(1.0)(0.33) + (0.5)(0.33) + (0)(0.34)] x [(1.0)(0.4) + (1.0) (0.6)] 
= 0.495. 
A similar calculation gives Pe(x2) x 0.85. We put these estimates together by Eq. 79 


as products P(x;) x Pelx;¡)Pp(x,) and renormalize (i.e., divide by their sum). Thus 
our final estimates for node X are 


(0.82)(0.495) E 
Pele) = 210.495) + (0.18)(0.85) ~ "6 
Pili eS = 0.274. 


(0.82)(0.495) + (0.18) (0.85) 


Thus given all the evidence throughout the belief net, the most probable outcome is 
xı = salmon. 


A given belief net can be used to infer any of the unknown variables. In Example 
3, we used information about the time of year, fishing location and some measured 
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properties of the fish to infer its identity (salmon or sea bass). The same network could 
instead be used to infer the probability that a fish is thin, or dark in color, based on 
probabilities of the identity of the fish, time of year, and so on (Problem 42). 

When the dependency relationships among the features used by a classifier are 
unknown, we generally proceed by taking the simplest assumption, i.e., that the 
features are conditionally independent given the category, i.e., 


d 


p(welx) œ J [ pwr). (85) 


i=l 


In practice, this so-called naive Bayes rule or idiot Bayes rule often works quite well 
in practice, and can be expressed by a very simple belief net (Problem 43). 

In Example 3 our entire belief net consisted of X, its parents and children, and 
we needed to update only the values on X. In the more general case, where the 
network is large, there may be many nodes whose values are unknown. In that case 
we may have to visit nodes randomly and update the probabilites until the entire 
configuration of probabilities is stable. It can be shown that under weak conditions, 
this process will converge to consistent values of the variables throughout the entire 
network (Problem 44). 

Belief nets have found increasing use in complicated problems such as medical 
diagnosis. Here the upper-most nodes (ones without their own parents) represent a 
fundamental biological agent such as the presence of a virus or bacteria. Intermediate 
nodes then describe diseases, such as flu or emphysema, and the lower-most nodes 
the symptoms, such as high temperature or coughing. A physician enters measured 
values into the net and finds the most likely disease or cause. Such networks can be 
used in a somewhat more sophisticated way, automatically computing which unknown 
variable (node) should be measured to best reveal the identity of the disease. 

We will return in Chap. ?? to address the problem of learning in such belief net 
models. 


3.10 Hidden Markov Models 


While belief nets are a powerful method for representing the dependencies and inde- 
pendencies among variables, we turn now to the problem of representing a particular 
but extremely important dependencies. In problems that have an inherent temporal- 
ity — that is, consist of a process that unfolds in time — we may have states at time 
t that are influenced directly by a state at t — 1. Hidden Markov models (HMMs) 
have found greatest use in such problems, for instance speech recognition or gesture 
recognition. While the notation and description is aunavoidably more complicated 
than the simpler models considered up to this point, we stress that the same underly- 
ing ideas are exploited. Hidden Markov models have a number of parameters, whose 
values are set so as to best explain training patterns for the known category. Later, a 
test pattern is classified by the model that has the highest posterior probability, i.e., 
that best “explains” the test pattern. 


3.10.1 First-order Markov models 


We consider a sequence of states at successive times; the state at any time t is denoted 
w(t). A particular sequence of length T is denoted by w? = {w(1),w(2),...,w(T)} as 
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for instance we might have w® = [w1,w4,wa,w2,w1,w4). Note that the system can 
revisit a state at different steps, and not every state need be visited. 

Our model for the production of any sequence is described by transition probabil- 
ities Plw;(t + 1)|w;(t)) = aj; — the time-independent probability of having state wj 
at step t + 1 given that the state at time t was w;. There is no requirement that the 
transition probabilities be symmetric (ai; 4 aji, in general) and a particular state 
may be visited in succession (a;; 4 0, in general), as illustrated in Fig. 3.9. 


Figure 3.9: The discrete states, w;, in a basic Markov model are represented by nodes, 
and the transition probabilities, a;;, by links. In a first-order discrete time Markov 
model, at any step t the full system is in a particular state w(t). The state at step 
t+1 is a random function that depends solely on the state at step t and the transition 
probabilities. 


Suppose we are given a particular model O — that is, the full set of a;; — as well 
as a particular sequence wf. In order to calculate the probability that the model 
generated the particular sequence we simply multiply the successive probabilities. 
For instance, to find the probability that a particular model generated the sequence 
described above, we would have P(w?|@) = a14a42a22a21a14. If there is a prior 
probability on the first state P(w(1) = w;), we could include such a factor as well; for 
simplicity, we will ignore that detail for now. 

Up to here we have been discussing a Markov model, or technically speaking, a 
first-order discrete time Markov model, since the probability at t+ 1 depends only on 
the states at t. For instance, in a Markov model for the production of spoken words, 
we might have states representing phonemes, and a Markov model for the production 
of a spoken work might have states representing phonemes. Such a Markov model for 
the word “cat” would have states for /k/, /a/ and /t/, with transitions from /k/ to 
/a/; transitions from /a/ to /t/; and transitions from /t/ to a final silent state. 

Note however that in speech recognition the perceiver does not have access to the 
states w(t). Instead, we measure some properties of the emitted sound. Thus we will 
have to augment our Markov model to allow for visible states — which are directly 
accessible to external measurement — as separate from the w states, which are not. 


3.10.2 First-order hidden Markov models 


We continue to assume that at every time step t the system is in a state w(t) but now 
we also assume that it emits some (visible) symbol v(t). While sophisticated Markov 
models allow for the emission of continuous functions (e.g., spectra), we will restrict 
ourselves to the case where a discrete symbol is emitted. As with the states, we define 
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a particular sequence of such visible states as VT = {v(1), v(2),..., v(T)) and thus we 
might have VË = {vs, v1, U1, Us, V2, U3}- 

Our model is then that in any state w(t) we have a probability of emitting a par- 
ticular visible state v(t). We denote this probability P(v,z(t)|w;(t)) = bjx. Because 
we have access only to the visible states, while the w; are unobservable, such a full 
model is called a hidden Markov model (Fig. 3.10) 


Figure 3.10: Three hidden units in an HMM and the transitions between them are 
shown in black while the visible states and the emission probabilities of visible states 
are shown in red. This model shows all transitions as being possible; in other HMMs, 
some such candidate transitions are not allowed. 


3.10.3 Hidden Markov Model Computation 


Now we define some new terms and clarify our notation. In general networks such as 
those in Fig. 3.10 are finite-state machines, and when they have associated transition 
probabilities, they are called Markov networks. They are strictly causal — the prob- 
abilities depend only upon previous states. A Markov model is called ergodic if every 
one of the states has a non-zero probability of occuring given some starting state. A 
final or absorbing state wo is one which, if entered, is never left (i.e., aoo = 1). 

As mentioned, we denote the transition probabilities a;; among hidden states and 
for the probability bj of the emission of a visible state: 


ij Plwj(t + Dwi) 
bik = Plue(t)|w;(4)). (86) 
We demand that some transition occur from step t  t + 1 (even if it is to the same 


state), and that some visible symbol be emitted after every step. Thus we have the 
normalization conditions: 


X aij = 1 for alli and 
j 
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Y bjk = 1 for all j, (87) 
k 


where the limits on the summations are over all hidden states and all visible symbols, 
respectively. 

With these preliminaries behind us, we can now focus on the three central issues 
in hidden Markov models: 


The Evaluation problem. Suppose we have an HMM, complete with transition 
probabilites a;; and bjk. Determine the probability that a particular sequence 
of visible states VT was generated by that model. 


The Decoding problem. Suppose we have an HMM as well as a set of observations 
VT. Determine the most likely sequence of hidden states w? that led to those 
observations. 


The Learning problem. Suppose we are given the coarse structure of a model (the 
number of states and the number of visible states) but not the probabilities a;, 
and bjk. Given a set of training observations of visible symbols, determine these 
parameters. 


We consider each of these problems in turn. 


3.10.4 Evaluation 
The probability that the model produces a sequence V” of visible states is: 


Tmax 


PV) =D PV" wr) Pwr), (88) 


r 
r=1 

where each r indexes a particular sequence w? = {w(1),w(2),...,w(T)} of T hidden 
states. In the general case of c hidden states, there will be rma, = c? possible 
terms in the sum of Eq. 88, corresponding to all possible sequences of length T. Thus, 
according to Eq. 88, in order to compute the probability that the model generated the 
particular sequence of T visible states VT, we should take each conceivable sequence 
of hidden states, calculate the probability they produce VT, and then add up these 
probabilities. The probability of a particular visible sequence is merely the product 
of the corresponding (hidden) transition probabilities a;; and the (visible) output 
probabilities bj of each step. 

Because we are dealing here with a first-order Markov process, the second factor 
in Eq. 88, which describes the transition probability for the hidden states, can be 
rewritten as: 


T 
Pwr) = [[ Plo(e)ho(t — 1) (89) 
t=1 
that is, a product of the a;,'s according to the hidden sequence in question. In 
Eq. 89, w(T) = wo is some final absorbing state, which uniquely emits the visible state 
vo. In speech recognition applications, wo typically represents a null state or lack of 
utterance, and vo is some symbol representing silence. Because of our assumption 
that the output probabilities depend only upon the hidden state, we can write the 
first factor in Eq. 88 as 
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T: 
P(V" wr) = [POA (90) 


that is, a product of b;,’s according to the hidden state and the corresponding visible 
state. We can now use Eqs. 89 & 90 to express Eq. 88 as 


Tmas 


T 
PV?) = Yo TT Pe@lo)) PwOlw(t- 1). (91) 


Despite its formal complexity, Eq. 91 has a straightforward interpretation. The 
probability that we observe the particular sequence of T visible states VT is equal to 
the sum over all rmaz possible sequences of hidden states of the conditional probability 
that the system has made a particular transition multiplied by the probability that 
it then emitted the visible symbol in our target sequence. All these are captured in 
our paramters a;; and bkj, and thus Eq. 91 can be evaluated directly. Alas, this is an 
O(c" T) calculation, which is quite prohibitive in practice. For instance, if c = 10 and 
T = 20, we must perform on the order of 10?! calculations. 

A computationaly simpler algorithm for the same goal is as follows. We can 
calculate P(VT) recursively, since each term P(v(t)|w(t))P(w(t)|w(t — 1)) involves 
only v(t), w(t) and w(t — 1). We do this by defining 


0 t = 0 and i F initial state 
a,(t)= 4 1 t = 0 and i = initial state (92) 
>j a(t — 1)aijbjkv(t) otherwise, 


where the notation b;,v(t) means the transition probability bj selected by the visible 
state emitted at time t. thus the only non-zero contribution to the sum is for the 
index k which matches the visible state v(t). Thus a;(t) represents the probability 
that our HMM is in hidden state w; at step t having generated the first t elements of 
VT. This calculation is implemented in the Forward algorithm in the following way: 


Algorithm 2 (HMM Forward) 


1 initialize w(1),t = 0, aij, bjp, visible sequence VT, a(0) = 1 
2 fort<—t+4+1 


Me 


3 a, (t) = ailt = 1)aijbjk 


i=1 


4 untilt=T 
5 return P(V*) — ao(T) 
6 end 


where in line 5, ag denotes the probability of the associated sequence ending to the 
known final state. The Forward algorithm has, thus, a computational complexity of 
O(c?T) — far more efficient than the complexity associated with exhaustive enumer- 
ation of paths of Eq. 91 (Fig. 3.11). For the illustration of c = 10, T = 20 above, we 
would need only on the order of 2000 calculations — more than 17 orders of magnitude 
faster than that to examine each path individually. 

We shall have cause to use the Backward algorithm, which is the time-reversed 
version of the Forward algorithm. 


Algorithm 3 (HMM Backward) 
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a(2) 


ERS 
eT 


Figure 3.11: The computation of probabilities by the Forward algorithm can be vi- 
sualized by means of a trellis — a sort of “unfolding” of the HMM through time. 
Suppose we seek the probability that the HMM was in state wa at t = 3 and gener- 
ated the observed visible up through that step (including the observed visible symbol 
ug). The probability the HMM was in state w;(t = 2) and generated the observed 
sequence through t = 2 is a; (2) for j = 1,2,...,c. To find a2(3) we must sum these 
and multiply the probability that state wə emitted the observed symbol vg. Formally, 


for this particular illustration we have a2(3) = be, S> aj(2)aj2. 
=1 


1 initialize w(T),t = T, aij, bx, visible sequence VT 
2 fort tl; 


4 Bj (t) t—= > Bilt + 1)aijbjkv(t + 1) 


5 untilt=1 
7 return P(VT) — 8;(0) for the known initial state 
8 end 


Example 4: Hidden Markov Model | 


To clarify the evaluation problem, consider an HMM such as shown in Fig. 3.10, 
but with an explicit absorber state and unique null visible symbol Vo with the following 
transition probabilities (where the matrix indexes begin at 0): 


1 0 0 0 
0.2 03 0.1 0.4 
Qij = and 


0.2 0.5 0.2 0.1 
0.8 0.1 0.0 0.1 


1 0 0 0 0 
0 0.3 0.4 0.1 0.2 
0 0.1 01 0.7 0.1 
0 0.5 0.2 0.1 0.2 


bjk 
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What is the probability it generates the particular sequence V° = {v3, v1, uz, V2, Vo}? 
Suppose we know the initial hidden state at t = 0 to be w1. The visible symbol at 
each step is shown above, and the a;(t) in each unit. The circles show the value for 
a;(t) as we progress left to right. The product a;¿b;; is shown along each transition 
link for the step t = 1 to t = 2. The final probability, P(V7|@) is hence 0.0011. 


t= 
The HMM (above) consists of four hidden states (one of which is an absorber state, 
wo), each emitting one of five visible states; only the allowable transitions to visible 
states are shown. The trellis for this HMM is shown below. In each node is a;(t) — the 
probability the model generated the observed visible sequence up to t. For instance, 
we know that the system was in hidden state w; at t = 1, and thus a,(0) = 1 and 
a;(0) = 0 for i 4 1. The arrows show the calculation of a;(1). for instance, since 
visible state vı was emitted at t = 1, we have ag(1) = a1(0)ai9bo1 = 1[0.2 x 0] = 0. 
as shown by the top arrow. Likewise the nest highest arrow corresponds to the 
calculation aj(1) = a1(0)a11b11 = 1[0.3 x 0.3] = 0.09. In this example, the calculation 
of a;(1) is particularly simple, since only transitions from the known initial hidden 
state need be considered; all other transitions have zero contribution to a;(1). For 
subsequent times, however, the caculation requires a sum over all hidden states at 
the previous time, as given by line 3 in the Forward algorithm. The probability 
shown in the final (absorbing) state gives the probability of the full sequence observed, 
P(V7|6) = 0.0011. 


If we denote our model — the a’s and b’s — by 0, we have by Bayes’ formula that 
the probability of the model given the observed sequence is: 


P(V"|0)P(0) 


P(0|VT) = PWT) 


(93) 
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In HMM pattern recognition we would have a number of HMMs, one for each category 
and classify a test sequence according to the model with the highest probability. Thus 
in HMM speech recognition we could have a model for “cat” and another one for 
“dog” and for a test utterance determine which model has the highest probability. In 
practice, nearly all HMMs for speech are left-to-right models (Fig. 3.12). 


Figure 3.12: A left-to-right HMM commonly used in speech recognition. For instance, 
such a model could describe the utterance “viterbi,” where w; represents the phoneme 
/v/, w2 represents /i/, ..., and wo a final silent state. Such a left-to-right model is 


more restrictive than the general HMM in Fig. 3.10, and precludes transitions “back” 
in time. 


The Forward algorithm gives us P(V7|0). The prior probability of the model, 
P(@), is given by some external source, such as a language model in the case of speech. 
This prior probability might depend upon the semantic context, or the previous words, 
or yet other information. In the absence of such information, it is traditional to assume 
a uniform density on P(@), and hence ignore it in any classification problem. (This 
is an example of a “non-informative” prior.) 


3.10.5 Decoding 


Given a sequence of visible states VT, the decoding problem is to find the most 
probable sequence of hidden states. While we might consider enumerating every 
possible path and calculating the probability of the visible sequence observed, this is 
an O(c? T) calculation and prohibitive. Instead, we use perhaps the simplest decoding 
algorithm: 


Algorithm 4 (HMM decoding) 


1 begin initialize Path = {},t = 0 


2 for t—t+1 

4 k =0,a9 =0 

5 for k=k+1 

[el 
Ni ar (t) = bjnv(t) ailt = 1)ai; 
i=l 
8 until k = c 
10 j’ — arg max a; (t) 
j 

11 AppendTo Path w; 
12 until t = T 
13 return Path 
14 end 


A closely related algorithm uses logarithms of the probabilities and calculates total 
probabilities by addition of such logarithms; this method has complexity O(c?T) 
(Problem 48). 


LEFT-TO- 
RIGHT 
MODEL 
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Omax(T) 


e 


Oax CEA 


T-1 


Figure 3.13: The decoding algorithm finds at each time step t the state that has the 
highest probability of having come from the previous step and generated the observed 
visible state vz. The full path is the sequence of such states. Because this is a 
local optimization (dependent only upon the single previous time step, not the full 
sequence), the algorithm does not guarantee that the path is indeed allowable. For 
instance, it might be possible that the maximum at t = 5 is w; and at t = 6 is wa, and 
thus these would appear in the path. This can even occur if a12 = P(wa(t+1)[w1 (t)) = 
0, precluding that transition. 


The red line in Fig. 3.13 corresponds to Path, and connects the hidden states with 
the highest value of a; at each step t. There is a difficulty, however. Note that there 
is no guarantee that the path is in fact a valid one — it might not be consistent with 
the underlying models. For instance, it is possible that the path actually implies a 
transition that is forbidden by the model, as illustrated in Example 5. 


Example 5: HMM decoding | 


We find the path for the data of Example 4 for the sequence {w1, w3, w2, W1, Wo}. 
Note especially that the transition from wz to wa is not allowed according to the tran- 
sition probabilities a;; given in Example 4. The path locally optimizes the probability 
through the trellis. 
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t= 1 2 3 4 5 
The locally optimal path through the HMM trellis of Example 4. 


HMMs address the problem of rate invariance in the following two ways. The first 
is that the transition probabilities themselves incorporate probabilistic structure of 
the durations. Moreover, using postprocessing, we can delete repeated states and just 
get the sequence somewhat independent of variations in rate. Thus in post-processing 
we can convert the sequence {w1,W1,W3,W2,W2,W2} to {w1, w3, w2}, which would be 
appropriate for speech recognition, where the fundamental phonetic units are not 
repeated in natural speech. 


3.10.6 Learning 


The goal in HMM learning is to determine model parameters — the transition prob- 
abilities aj; and bjk — from an ensemble of training samples. There is no known 
method for obtaining the optimal or most likely set of parameters from the data, but 
we can nearly always determine a good solution by a straightforward technique. 


The Forward-backward Algorithm 


The Forward-backward algorithm is an instance of a generalized Expectation-Maximization 
algorithm. The general approach will be to iteratively update the weights in order to 
better explain the observed training sequences. 

Above, we defined a;(t) as the probability that the model is in state w;(t) and has 
generated the target sequence up to step t. We can analogously define (;(t) to be 
the probability that the model is in state w;(t) and will generate the remainder of the 
given target sequence, i.e., from t + 1— T. We express /3;(t) as: 


0 w¡(t) 4 sequence’s final state and t = T 
pit) =xX 1 wilt) = sequence’s final state and t = T 

ajjbjjv(t+1)B,(t +1) otherwise, 

j 


(94) 
To understand Eq. 94, imagine we knew a;(t) up to step T — 1, and we wanted to 
calculate the probability that the model would generate the remaining single visible 
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symbol. This probability, 3;(T), is just the probability we make a transition to state 
wi(T) multiplied by the probability that this hidden state emitted the correct final visi- 
ble symbol. By the definition of 6,(T”) in Eq. 94, this will be either 0 (if w;(T) is not the 
final hidden state) or 1 (if it is). Thus it is clear that 6;(T—1) =>), aijbijv(Z) Gi(T). 
Now that we have determined (;(T — 1), we can repeat the process, to determine 
B¡(T — 2), and so on, backward through the trellis of Fig. ??. 

But the a;(t) and (;(t) we determined are merely estimates of their true values, 
since we don't know the actual value of the transition probabilities a;; and b;; in 
Eq. 94. We can calculate an improved value by first defining y;¿(t) — the probability 
of transition between w;(t—1) and w;(t), given the model generated the entire training 
sequence VT by any path. We do this by defining 7;;(t), as follows: 


ailt — 1)aijbij bilt) 
P(VT]|8) ' 


vig (t) = (95) 
where P(V7|@) is the probability that the model generated sequence VT by any path. 
Thus 7; (t) is the probability of a transition from state w;(t — 1) to w;(t) given that 
the model generated the complete visible sequence VT. 

We can now calculate an improved estimate for aij. The expected number of 
transitions between state w;(t — 1) and w;(t) at any time in the sequence is simply 
ys 7i;(t), whereas at step t it is De Nr Jik(t). Thus a; (the estimate of the 
probability of a transition from w;(t — 1) to w,(t)) can be found by taking the ratio 
between the expected number of transitions from w; to w; and the total expected 
number of any transitions from w;. That is: 


i = == (96) 
2 3 Vir (t) 


In the same way, we can obtain an improved estimate bij by calculating the ratio 
between the frequency that any particular symbol vz is emitted and that for any 
symbol. Thus we have 

j 2 V5k(t) 


jk = UN (97) 


2 Yjk(t) 


In short, then, we start with rough or arbitrary estimates of a;; and bjk, calculate 
improved estimates by Eqs. 96 & 97, and repeat until some convergence criterion 
is met (e.g., sufficiently small change in the estimated values of the parameters on 
subsequent iterations). This is the Baum-Welch or Forward-backward algorithm — 
an example of a Generalized Expectation-Maximumization algorithm (Sec. 3.8): 


Algorithm 5 (Forward-backward) 


1 begin initialize a;j, bjk, training sequence VT convergence criterion 0 


2 do 2 2+1 

3 Compute á(z) from a(z — 1) and b(z — 1) by Eq. 96 
4 Compute 6(z) from a(z — 1) and b(z — 1) by Eq. 97 
5 aij(z) — aaj (z — 1) 

6 bir (z) — bya (z = 1) 
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7 until max; (z) — aij(z — 1), bjk(z) — bjk(z — 1)] < 0; convergence achievedin : For Backstop 
ij, 


8 return aij — aij(z); bjk — bjk(z) 
9 end 


The stopping or convergence criterion in line ?? halts learning when no estimated 
transition probability changes more than a predetermined amount, 0. In typical 
speech recognition applications, convergence requires several presentations of each 
training sequence (fewer than five is common). Other popular stopping criteria are 
based on overall probability that the learned model could have generated the full 
training data. 


Summary 


If we know a parametric form of the class-conditional probability densities, we can 
reduce our learning task from one of finding the distribution itself, to that of find- 
ing the parameters (represented by a vector 0; for each category w;), and use the 
resulting distributions for classification. The maximum likelihood method seeks to 
find the parameter value that is best supported by the training data, i.e., maximizes 
the probability of obtaining the samples actually observed. (In practice, for com- 
putational simplicity one typically uses log-likelihood.) In Bayesian estimation the 
parameters are considered random variables having a known a priori density; the 
training data convert this to an a posteriori density. The recursive Bayes method 
updates the Bayesian parameter estimate incrementally, i.e., as each training point 
is sampled. While Bayesian estimation is, in principle, to be preferred, maximum 
likelihood methods are generally easier to implement and in the limit of large training 
sets give classifiers nearly as accurate. 

A sufficient statistic s for O is a function of the samples that contains all infor- 
mation needed to determine O. Once we know the sufficient statistic for models of a 
given form (e.g., exponential family), we need only estimate their value from data to 
create our classifier — no other functions of the data are relevant. 

Expectation-Maximization is an iterative scheme to maximize model parameters, 
even when some data are missing. Each iteration employs two steps: the expectation 
or E step which requires marginalizing over the missing variables given the current 
model, and the maximization or M step, in which the optimum parameters of a new 
model are chosen. Generalized Expectation-Maximization algorithms demand merely 
that parameters be improved — not optimized — on each iteration and have been 
applied to the training of a large range of models. 

Bayesian belief nets allow the designer to specify, by means of connection topology, 
the functional dependences and independencies among model variables. When any 
subset of variables is clamped to some known values, each node comes to a proba- 
bility of its value through a Bayesian inference calculation. Parameters representing 
conditional dependences can be set by an expert. 

Hidden Markov models consist of nodes representing hidden states, interconnected 
by links describing the conditional probabilities of a transition between the states. 
Each hidden state also has an associated set of probabilities of emiting a particular 
visible states. HMMs can be useful in modelling sequences, particularly context depen- 
dent ones, such as phonemes in speech. All the transition probabilities can be learned 
(estimated) iteratively from sample sequences by means of the Forward-backward or 
Baum-Welch algorithm, an example of a generalized EM algorithm. Classification 
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proceeds by finding the single model among candidates that is most likely to have 
produced a given observed sequence. 


Bibliographical and Historical Remarks 


Maximum likelihood and Bayes estimation have a long history. The Bayesian ap- 
proach to learning in pattern recognition began by the suggestion that the proper 
way to use samples when the conditional densities are unknown is the calculation 
of P(w;|x,D), [6]. Bayes himself appreciated the role of non-informative priors. An 
analysis of different priors from statistics appears in [21, 15] and [4] has an extensive 
list of references. 

The origins of Bayesian belief nets traced back to [33], and a thorough literature 
review can be found in [8]; excellent modern books such as [24, 16] and tutorials [7] 
can be recommended. An important dissertation on the theory of belief nets, with 
an application to medical diagnosis is [14], and a summary of work on diagnosis of 
machine faults is [13]. While we have focussed on directed acyclic graphs, belief nets 
are of broader use, and even allow loops or arbitrary topologies — a topic that would 
lead us far afield here, but which is treated in [16]. 

The Expectation-Maximization algorithm is due to Dempster et al.[11] and a thor- 
ough overview and history appears in [23]. On-line or incremental versions of EM are 
described in [17, 31]. The definitive compendium of work on missing data, including 
much beyond our discussion here, is [27]. 

Markov developed what later became called the Markov framework [22] in order 
to analyze the the text of his fellow Russian Pushkin’s masterpiece Eugene Onegin. 
Hidden Markov models were introduced by Baum and collaborators [2, 3], and have 
had their greatest applications in the speech recognition [25, 26], and to a lesser extent 
statistical language learning [9], and sequence identification, such as in DNA sequences 
[20, 1]. Hidden Markov methods have been extended to two-dimensions and applied 
to recognizing characters in optical document images [19]. The decoding algorithm is 
related to pioneering work of Viterbi and followers [32, 12]. The relationship between 
hidden Markov models and graphical models such as Bayesian belief nets is explored 
in [29]. 

Knuth’s classic [18] was the earliest compendium of the central results on com- 
putational complexity, the majority due to himself. The standard books [10], which 
inspired several homework problems below, are a bit more accessible for those with- 
out deep backgrounds in computer science. Finally, several other pattern recognition 
textbooks, such as [28, 5, 30] which take a somewhat different approach to the field 
can be recommended. 


Problems 


Q Section 3.2 


1. Let x have an exponential density 


Be 9 z>0 
p(z|8) = { 0 otherwise. 


(a) Plot p(x|0) versus x for 0 = 1. Plot p(x|0) versus 0, (0 < @ < 5), for x = 2. 
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(b) Suppose that n samples x1, ..., £n are drawn independently according to p(x|0). 
Show that the maximum likelihood estimate for 0 is given by 


1 


d= a 4 

$ 

= Do Tk 
k=1 


(c) On your graph generated with 6 = 1 in part (a), mark the maximum likelihood 
estimate 0 for large n. 


2. Let x have a uniform density 


1/0 0O<x<0 
p(z|0) ~ U(0,0) = { al otherwise. 


(a) Suppose that n samples D = [x1,..., £n} are drawn independently according to 
p(x]0). Show that the maximum likelihood estimate for 0 is max[D], i.e., the 
value of the maximum element in D. 


(b) Suppose that n = 5 points are drawn from the distribution and the maximum 
value of which happens to be maxx, = 0.6. Plot the likelihood p(D|@) in the 


range 0 < 0 < 1. Explain in words why you do not need to know the values of 
the other four points. 


3. Maximum likelihood methods apply to estimates of prior probabilities as well. 
Let samples be drawn by successive, independent selections of a state of nature wi 
with unknown probability P(w;). Let z;, = 1 if the state of nature for the kth sample 
is w; and Zig = 0 otherwise. 


(a) Show that 


Plz. š Zim P(w;)) = II P(w;)*** (1 = P(w) T. 


Interpret your result in words. 


4. Let x be a d-dimensional binary (0 or 1) vector with a multivariate Bernoulli 
distribution 


d 
P(x|@) = [00 -0 , 
i=1 


where 0 = (0,,...,04)' is an unknown parameter vector, 6; being the probability that 
xi = 1. Show that the maximum likelihood estimate for @ is 


0 =~ Sox. 
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5. Let each component x; of x be binary valued (0 or 1) in a two-category problem 
with P(w1) = P(wa) = 0.5. Suppose that the probability of obtaining a 1 in any 
component is 


pi = P 
Pi = 1—P, 


and we assume for definiteness p > 1/2. The probability of error is known to approach 
zero as the dimensionality d approaches infinity. This problem asks you to explore the 
behavior as we increase the number of features in a single sample — a complementary 
situation. 


(a) Suppose that a single sample x = (21,..., 14)* is drawn from category w,. Show 
that the maximum likelihood estimate for p is given by 


1 d 


i=l 


(b) Describe the behavior of p as d approaches infinity. Indicate why such behavior 
means that by letting the number of features increase without limit we can 
obtain an error-free classifier even though we have only one sample from each 
class. 


d 
(c) Let T = 1/d >> x; represent the proportion of 1's in a single sample. Plot 
j=l 
P(T|w;) vs. T for the case P = 0.6, for small d and for large d (e.g., d = 11 and 
d = 111, respectively). Explain your answer in words. 


6. Derive Eqs. 18 & 19 for the maximum likelihood estimation of the mean and 
covariance of a multidimensional Gaussian. State clearly any assumptions you need 
to invoke. 

7. Show that if our model is poor, the maximum likelihood classifier we derive 
is not the best — even among our (poor) model set — by exploring the following 
example. Suppose we have two equally probable categories (i.e., P(w1) = P(w2) = 
0.5). Further, we know that p(x|w1) ~ N(0,1) but assume that p(alw2) ~ N(p, 1). 
(That is, the parameter 0 we seek by maximum likelihood techniques is the mean of 
the second distribution.) Imagine however that the true underlying distribution is 
p(xlw2) ~ N(1, 10°). 


(a) What is the value of our maximum likelihood estimate Ô in our poor model, 
given a large amount of data? 


(b) What is the decision boundary arising from this maximum likelihood estimate 
in the poor model? 


(c) Ignore for the moment the maximum likelihood approach, and use the methods 
from Chap. ?? to derive the Bayes optimal decision boundary given the true 
underlying distributions — p(x|w1) ~ N(0,1) and p(z|w2) ~ N(1,10%). Be 
careful to include all portions of the decision boundary. 


(d) Now consider again classifiers based on the (poor) model assumption of p(x|w2) ~ N (p, 1). 
Using your result immediately above, find a new value of y that will give lower 
error than the maximum likelihood classifier. 
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(e) Discuss these results, with particular attention to the role of knowledge of the 
underlying model. 


8. Consider an extreme case of the general issue discussd in Problem 7, one in 
which it is possible that the maximum likelihood solution leads to the worst possible 
classifier, i.e., one with an error that approaches 100% (in probability). Suppose our 
data in fact comes from two one-dimensional distributions of the forms 


palo) ~ [1—k)d(a—-1)+kd(a+ X)] and 
pales) ~ [(1—k)6(e +1) + köle — X)], 


where X is positive, 0 < k < 0.5 represents the portion of the total probability mass 
concentrated at the point +X, and 6(-) is the Dirac delta function. Suppose our poor 
models are of the form p(2|w1, p1) ~ N(u1, 07) and p(z|w2, ua) ~ N(ua, 03) and we 
form a maximum likelihood classifier. 


(a) Consider the symmetries in the problem and show that in the infinite data case 
the decision boundary will always be at x = 0, regardless of k and X. 


(b) Recall that the maximum likelihood estimate of either mean, /i;, is the mean 
of its distribution. For a fixed k, find the value of X such that the maximum 
likelihood estimates of the means “switch,” i.e., where fi, > fla. 


(c) Plot the true distributions and the Gaussian estimates for the particular case 
k = .2 and X = 5. What is the classification error in this case? 


(d) Find a dependence X(k) which will guarantee that the estimated mean ji; of 
p(a|w 1) is less than zero. (By symmetry, this will also insure jig > 0.) 


e) Given your X(k) just derived, state the classification error in terms of k. 
J 


(£) Suppose we constrained our model space such that 0? = 0? = 1 (or indeed any 
other constant). Would that change the above results? 


(g) Discuss how if our model is wrong (here, does not include the delta functions), 
the error can approaches 100% (in probability). Does this surprising answer 
arise because we have found some local minimum in parameter space? 


9. Prove the invariance property of maximum likelihood estimators, i.e., that if Ê is 
the maximum likelihood estimate of 0, then for any differentiable function T(-), the 
maximum likelihood estimate of 7(0) is T(0). 

10. Suppose we employ a novel method for estimating the mean of a data set 
D = {X1,X2,...,Xn}: we assign the mean to be the value of the first point in the set, 


i.e., X1. 
(a) Show that this method is unbiased. 
(b) State why this method is nevertheless highly undesirable. 


11. One measure of the difference between two distributions in the same space is the 
Kullback-Leibler divergence of Kullback-Leibler “distance” : 


Dxi(p1(x), p2(x)) = [com pe dx. 
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(This “distance,” does not obey the requisite symmetry and triangle inequalities for a 
metric.) Suppose we seek to approximate an arbitrary distribution po(x) by a normal 
pi(x) ~ N(p, E). Show that the values that lead to the smallest Kullback-Leibler 
divergence are the obvious ones: 


u = &[x] 
D = &|(x- u)(x- pu), 
where the expectation taken is over the density p2(x). 


Q Section 3.3 


12. Justify all the statements in the text leading from Eq. 25 to Eq. 26. 
Q Section 3.4 


13. Let p(x[X) ~ N(u, 2) where y is known and > is unknown. Show that the 
maximum likelihood estimate for X is given by 


5 = 2) Vo — 2) — 1)! 


by carrying out the following argument: 


(a) Prove the matrix identity a’Aa = tr[Aaa‘], where the trace, tr[A], is the sum 
of the diagonal elements of A. 


(b) Show that the likelihood function can be written in the form 


E) (xn — u) (xe — w| . 
k=1 


1 


p(x1,...,Xn|2) = Era? 


1 
|=—-1|"/2exp jp 


(c) Let A = EIS and Ay,..., An be the eigenvalues of A; show that your result 
above leads to 
1 


== lA AN? PNA. 
Oras, l a) exp | zl 1+ ás | 


p(x1,...,Xn|2) = 


(d) Complete the proof by showing that the likelihood is maximized by the choice 
Ay =- =Aq=1. Explain your reasoning. 


14. Suppose that p(x|u,;, X, wi) ~ N(w;, 3), where X is a common covariance matrix 
for all c classes. Let n samples x1,...,X, be drawn as usual, and let l;,...,l, be their 
labels, so that l = 1 if the state of nature for x, was wi. 


(a) Show that 


D(X1, ey Xp 11) +, dnl bys es Me, E) = 


m P(w1,,) m 
j= -1 
Eras A? [4 LO ET O mI] 
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(b) Using the results for samples drawn from a single normal population, show that 
the maximum likelihood estimates for u; and © are given by 


and 
SM . = 
D= Cs — fey, (Xx — hy," 


Interpret your answer in words. 
15. Consider the problem of learning the mean of a univariate normal distribution. 


Let ny = 0*/0% be the dogmatism, and imagine that jo is formed by averaging no 
fictitious samples zk, k = —no + 1, —no +2,..., 0. 


(a) Show that Eqs. 32 & 33 for un and o? yield 
1 


n 
Hn = > Tk 
n n 
- Ok=-no+1 


and 


2 oO 


n+ no 


n 


(b) Use this result to give an interpretation of the a priori density p(w) ~ N (u0, 0%). 
16. Suppose that A and B are nonsingular matrices of the same order. 


(a) Prove the matrix identity 
(A`! + Bt)! = A(A+B)"'B=B(A+B)'A. 


(b) Must these matrixes be square for this identity to hold? 


(c) Use this result in showing that Eqs. 46 & 47 do indeed follow from Eqs. 42 & 
43. 
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17. The purpose of this problem is to derive the Bayesian classifier for the d- 
dimensional multivariate Bernoulli case. As usual, work with each class separately, 
interpreting P(x|D) to mean P(x|D;,w;). Let the conditional probability for a given 
category be given by 


d 
P(x|@) = [] ea - 4)", 


i=l 


and let D = {x,...,Xn} be a set of n samples independently drawn according to this 
probability density. 
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(a) If s = (s1,..., 8a)’ is the sum of the n samples, show that 
d 
P(D|6) = | [ 6% (1 - 0)". 
i=1 


(b) Assuming a uniform a priori distribution for O and using the identity 


show that 


(c) Plot this density for the case d = 1,n = 1, and for the two resulting possibilities 
for sı. 


(d) Integrate the product P(x|@)p(@|D) over 6 to obtain the desired conditional 
probability 


(e) If we think of obtaining P(x|D) by substituting an estimate @ for O in P(x|6), 
what is the effective Bayesian estimate for 0? 


18. Consider how knowledge of an invariance can guide our creation of a prior in the 
following case. Suppose we have a binary (0 or 1) variable x, chosen independently 
with a probability p(0) = p(x = 1). Imagine we have observed D” = {2, £2,...,Un}, 
and now wish to evaluate the probability that 7,4; = 1, which we express as a ratio: 


P(tn41 =1/D") 
P(Sn41 = oD”) Í 


(a) Define s = 21 +-+-+ £n and p(t) = P(x +-+- + n41 = t). Assume now 
invariance of exchangeability, i.e., that the samples in any set D” could have 
been selected in an arbitrary order and it would not affect any probabilities. 
Show how this assumption of exchangeability implies the ratio in question can 


be written 
p(s +1)/ C4) 
AS? 
where pad = wes is the binomial coefficient. 


(b) Evaluate this ratio given the assumption p(s) ~ p(s +1), when n and n — s and 
s are not too small. Interpret your answer in words. 
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(c) In the binomial framework, we now seek a prior p(0) such that p(s) does not 
depend upon s, where 


p(s) = j (o)ra -0v0 a. 
0 


Show that this requirement is satisfied if p(@) is uniform, i.e., p(0) ~ U (0,1). 


19. Assume we have training data from a Gaussian distribution of known covari- 
ance X but unknown mean p. Suppose further that this mean itself is random, and 
characterized by a Gaussian density having mean mọ and covariance No. 


(a) What is the MAP estimator for 1? 


(b) Suppose we transform our coordinates by a linear transform x’ = Ax, for non- 
singular matrix A, and accordingly for other terms. Determine whether your 
MAP estimator gives the appropriate estimate for the transformed mean p. 
Explain. 


20. Suppose for a given class with parameter s the density can be written as: 


pela) = ~F (2). 


x 

a 

In such a case we say that a is a scale parameter. For instance, the standard deviation 
g is a scale parameter for a one-dimensional Gaussian. 


(a) Imagine that we measure x’ = ax instead of x, for some constant a. Show that 
the density now can be written as 


1 g 
ve = 47 (5). 
Find a’. 


(b) Find the non-informative prior for a’, written as p'(a”). You will need to note 
that for any interval A € (0,00) the following equation should hold: 


Jrtayda= f po) de 


A A 
21. State the conditions on p(x|@), on p(@), and on D” that insure that the estimate 
p(0/D”) in Eq. 54 converges in the limit n — oo. 
Q Section 3.6 
22. Employ the notation of the chapter and suppose s is a sufficient statistic statis- 


tics for which p(@|s,D) = p(@|s). Assume p(@|s) 4 0 and prove that p(D|s, 0) is 
independent of 6. 
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23. Using the results given in Table 3.1, show that the maximum likelihood estimate 
for the parameter 0 of a Rayleigh distribution is given by 


1 


§ = ——— 
L 7; 
k=1 


24. Using the results given in Table 3.1, show that the maximum likelihood estimate 
for the parameter 0 of a Maxwell distribution is given by 


zle 


25. Using the results given in Table 3.1, show that the maximum likelihood estimate 
for the parameter @ of a multinomial distribution is given by 


where the vector s = (s1, ..., Sq)* is the average of the n samples X1, ...,Xn- 


26. Demonstrate that sufficiency is an integral concept, i.e., that if s is sufficient for 

0, then corresponding components of s and O need not be sufficient. Do this for the 
case of a univariate Gaussian p(x) ~ N(y,07) where O = (f) is the full vector of 
parameters. 


(a) Verify that the statistic 


is indeed sufficient for 0, as given in Table 3.1. 


(b) Show that sı taken alone is not sufficient for u. Does your answer depend upon 
whether o? is known? 


(c) Show that sz taken alone is not sufficient for o°. Does your answer depend upon 
whether u is known? 


27. Suppose s is a statistic for which p(0|x, D) = p(0|s). 
(a) Assume p(0|s) 4 0, and prove that p(D|s, 0) is independent of 6. 


(b) Create an example to show that the inequality p(@|s) 4 0 is required for your 
proof above. 


28. Consider the Cauchy distribution, 


1 1 


Day 


for b > 0 and arbitrary real a. 
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(a) Confirm that the distribution is indeed normalized. 


(b) For a fixed a and b, try to calculate the mean and the standard deviation of the 
distribution. Explain your results. 


(c) Prove that this distribution has no sufficient statistics for the mean and standard 
deviation. 
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29. In the following, suppose a and 6 are constants and n a variable parameter. 
(a) Is a”tt = O(a”)? 
(b) Is a” = O(a”)? 


) 

) 
(c) Is a” = O(a”)? 
(d) Prove f(n) = O(f(n)). 


n—1 . 

30. Consider the evaluation of a polynomial function f(x) = Y” ax”, where the n 
i=0 

coefficients a; are given. 


(a) Write pseudocode for a simple O(n?)-time algorithm for evaluating f(x). 


(b) Show that such a polynomial can be rewritten as: 
f(x) = 5 ajz? = (+++ (Gn—1% + an—2)£ +-+- + a1)£ + a0, 


and so forth — a method known as Horner’s rule. Use the rule to write pseu- 
docode for a O(n)-time algorithm for evaluating f(x). 


31. For each of the short procedures, state the computational complexity in terms 
of the variables N, M, P, and K, as appropriate. Assume that all data structures 
are defined, that those without indexes are scalars and that those with indexes have 
the number of dimensions shown. 


Algorithm 6 


1 begin for i— i+ 1 
2 s—s+i’ 
3 wntili=N 

4 return s 

5 end 


Algorithm 7 


1 begin fori — i+ 1 


2 s= 8 + 2X T 
3 untili=N 

4 returny/s 

5 end 
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Algorithm 8 


1 begin for 7-—7+1 


2 forici+l 

3 Sj — Sj + Wij Xi 
4 until 1 = I 

5 until j=J 

6 fork=k+1 

7 for p+=3 +1 

8 Tk — Tk + WjkSj 

9 until j = J 

zo wntil k= K 

11 end 


32. Consider a computer having a uniprocessor that can perform one operation per 
nanosecond (107? sec). The left column of the table shows the functional dependence 
of such operations in different hypothetical algorithms. For each such function, fill in 
the number of operations n that can be performed in the total time listed along the 
top. 


f(n) 1 sec | 1 hour | 1 day | 1 year 


33. Show that the estimator of Eq. 21 is indeed unbiased for: 

(a) Normal distributions. 

(b) Cauchy distributions. 

(c) Binomial distributions. 

(d) Prove that the estimator of Eq. 20 is asymptotically unbiased. 


34. Let the sample mean f4,, and the sample covariance matrix C,, for a set of n 
samples X1, ...,Xņn (each of which is d-dimensional) be defined by 


fin =>) xk 
k=1 
and 
2 > E 
n = e — Hn) Xk — Hn) - 
n— tows dy ¡a 


We call these the “non-recursive” formulas. 
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(a) What is the computational complexity of calculating f1,, and C,, by these for- 
mulas? 


(b) Show that alternative, “recursive” techniques for calculating f4,, and Cn based 
on the successive addition of new samples x, +1 can be derived using the recur- 
sion relations 


lint = Bin + ga — Ôn) 
= — AX = Bn 
Hn+1 Hn n +1 +1 H 
and 
n-—1 1 7 E 
Choi = 7 Cr + wed (Xn41 — by) (Xn41 — pig)’: 


(c) What is the computational complexity of finding fà, and C,, by these recursive 
methods? 


(d) Describe situations where you might prefer to use the recursive method for com- 
puting f, and C,,, and ones where you might prefer the non-recursive method. 


35. In pattern classification, one is often interested in the inverse of the covariance 
matrix, for instance when designing a Bayes classifier for Gaussian distributions. Note 
that the non-recursive calculation of C7! (the inverse of the covariance matrix based 
on n samples, cf., Problem 34) might require the O(n*) inversion of Cn by standard 
matrix methods. We now explore an alternative, “recursive” method for computing 
Ce. 

(a) Prove the so-called Sherman-Morrison-Woodbury matrix identity 


A7lxy!A7! 


A )-l_ AT! : 
(ay) 1l+ytA-lx 


(b) Use this and the results of Problem 34 to show that 


cu! — n =1 Cz COn am fbn) (Xn41 = MALO 
n ~~ n n2— = = m s 
H mi a + (Xn+1 = O at = fly) 


(c) What is the computational complexity of this calculation? 


(d) Describe situations where you would use the recursive method, and ones where 
you would use instead the non-recursive method. 


36. Suppose we wish to simplify (or regularize) a Gaussian classifier for two categories 

by means of shrinkage. Suppose that the estimated distributions are N (u1, 1) and 
N (m2, 2). In order to employ shrinkage of an assumed common covariance toward 
the identity matrix as given in Eq. 77, show that one must first normalize the data 
to have unit variance. 


@ Section 3.8 
37. Consider the convergence of the Expectation-Maximization algorithm, i.e., that 


if 1(0,D,) =Inp(D,; 6) is not already optimum, then the EM algorithm increases it. 
Prove this as follows: 
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(a) First note that 
1(0; Dg) =Inp(Dy, Dv; 8) — Inp(Do|Dy; 0). 


Let €’[-] denote the expectation with respect to the distribution p(Dp|D,; 0”). 
Take such an expectation of 1(0; D,), and express your answer in terms of 
Q(0, 0’) of Eq. 78. 


(b) Define 4(D,) = p(D,|D,; 0)/p(D,|D,; 9’) to be the ratio of expectations as- 
suming the two distributions. Show that €'[Inp(D;,)] < E'[p(D;)] — 1 = 0. 


(c) Use this result to show that if Q(0***, 0°) > Q(0*, 6°), achieved by the M step 
in Algorithm ??, then 1(0*+*; D,) > 1(0*; Dg). 


38. Suppose we seek to estimate 0 describing a multidimensional distribution from 
data D, some of whose points are missing features. Consider an iterative algorithm in 
which the maximum likelihood value of the missing values is calculated, then assumed 
to be correct for the purposes of restinating O and iterated. 


(a) Is this always equivalent to an Expectation-Maximization algorithm, or just a 
generalized Expectation-Maximization algorithm? 


(b) If it is an Expectation-Maximization algorithm, what is Q(0, 0"), as described 
by Eq. 78? 


39. Consider data D = es È), È), ($), eae sampled from a two-dimensional 
uniform distribution 


1 if tu < £1 Sty 
|tui—@11||@u2—212] and 219 < 19 < £ 
2 £ T2 < Tu2 
p(x) ~ U(x), Xu) = 
0 otherwise, 


where * represents missing feature values. 


(a) Start with an initial estimate 


and analytically calculate Q(0,0%) — the E step in the EM algorithm. 
(b) Find the @ that maximizes your Q(0,0%) — the M step. 
(c) Plot your data and the bounding rectangle. 


(d) Without having to iterate further, state the estimate of O that would result after 
convergence of the EM algorithm. 


40. Consider data D = eae EP (yh sampled from a two-dimensional (separable) 
distribution p(x1, 22) = p(x1)p(x2), with 
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1] ¿01%1 if >0 
a 9, € 1 ti 2 
p(x1) { 0 otherwise, 
and 
1 . 
S o fg #0<r<90 
p(x2) U(0, 62) = { 0 otherwise. 


As usual, * represents a missing feature value. 


(a) Start with an initial estimate 0% = £) and analytically calculate Q(0,0%) — 
the E step in the EM algorithm. Be sure to consider the normalization of your 


distribution. 
(b) Find the O that maximizes your Q(0,0%) — the M step. 


(c) Plot your data on a two-dimensional graph and indicate the new parameter 
estimates. 


41. Repeat Problem 40 but with data D = { (}), Els (S) y 
Q Section 3.9 


42. Use the conditional probability matrices in Example 3 to answer the following 
separate problems. 


(a) Suppose it is December 20 — the end of autumn and the beginning of winter 
— and thus let P(a,) = P(a4) = 0.5. Furthermore, it is known that the fish 
was caught in the north Atlantic, i.e., P(b1) = 1. Suppose the lightness has not 
been measured but it is known that the fish is thin, i.e., P(d2) = 1. Classify the 
fish as salmon or sea bass. What is the expected error rate? 


(b) Suppse all we know is that a fish is thin and medium lightness. What season is 
it now, most likely? What is your probability of being correct? 


(c) Suppose we know a fish is thin and medium lightness and that it was caught in 
the north Atlantic. What season is it, most likely? What is the probability of 
being correct? 


43. One of the simplest assumptions is that of the naive Bayes rule or idiot Bayes 
rule expressed in Eq. 85. Draw the belief net for a three-category problem with five 
features x;, i = 1,2,...5. 

44. Consider a Bayesian belief net with several nodes having unspecified values. 
Suppose that one such node is selected at random, the probabilities of its nodes 
computed by the formulas described in the text. Next another such node is chosen at 
random (possibly even a node already visited), and the probabilities similarly updated. 
Prove that this procedure will converge to the desired probabilities throughout the 
full network. 
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45. Consider training an HMM by the Forward-backward algorithm, for a single 
sequence of length T where each symbol could be one of c values. What is the 
computational complexity of a single revision of all values a,; and b; k? 

46. The standard method for calculating the probability of a sequence in a given 
HMM is to use the forward probabilities a;(t). 
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(a) Show by a simple substitution that a symmetric method can be derived using 
the backward probabilities 5;(t). 


(b) Prove that one can get the probability by combining the forward and the back- 
ward probabilities at any place in the middle of the sequence. That is, show 


that 
T’ 
P(w") = 5 au (t) Bi (t), 
i=1 
where wT is a particular sequence of length T’ < T. 


(c) Show that your formula reduces to the known values at the beginning and end 
of the sequence. 


47. Suppose we have a large number of symbol sequences emitted from an HMM 
that has a particular transition probability ayj; = 0 for some single value of i’ and 
j'. We use such sequences to train a new HMM, one that happens also to start 
with its ay; = 0. Prove that this parameter will remain 0 throughout training by 
the Forward-backward algorithm. In other words, if the topology of the trained model 
(pattern of non-zero connections) matches that of the generating HMM, it will remain 
so after training. 


48. Consider the decoding algorithm (Algorithm 4) in the text. 


(a) Take logarithms of HMM model parameters and write pseudocode for an equiv- 
alent algorithm. 


(b) Explain why taking logarithms is an O(n) calculation, and thus the complexity 
of your algorithm in (a) is O(c?T). 


49. Explore the close relationship between Bayesian belief nets and hidden Markov 
models as follows. 


(a) Prove that the forward and the backward equations for hidden Markov models 
are special cases of Eq. 84. 


(b) Use your answer to explain the relationship between these two general classes 


of models. 


Computer exercises 


Several exercises will make use of the following three-dimensional data sampled from 
three categories, denoted w;. 
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Wy wa W3 
point | x1 T2 £3 Ly T2 £3 Ly T2 £3 
1 0.42 -0.087 0.58 | -0.4 0.58 0.089 0.83 1.6 -0.014 


-0.2 -3.3 -3.4 | -0.31 0.27 -0.04 1.1 1.6 0.48 
1.3 -0.32 17 0.38 0.055 -0.035 | -0.44 -0.41 0.32 
0.39 0.71 0.23 | -0.15 0.53 0.011 | 0.047 -0.45 1.4 
-1.6 -5.3 -0.15 | -0.35 0.47 0.034 | 0.28 0.35 3.1 

i -4.7 | 0.17 0.69 0.1 -0.39 -0.48 0.11 
-0.23 19 2.2 -0.011 0.55 -0.18 0.34 -0.079 0.14 
0.27 -0.3 -0.87 | -0.27 0.61 0.12 -0.3 -0.22 2.2 
-1.9 0.76 -2.1 | -0.065 0.49 0.0012 | 1.1 1.2 -0.46 
-2.6 | -0.12 0.054 -0.063 | 0.18 -0.11 -0.49 
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1. Consider Gaussian density models in different dimensions. 


(a) Write a program to find the maximum likelihood values fi and 6?. Apply your 
program individually to each of the three features x; of category w1 in the table 
above. 


(b) Modify your program to apply to two-dimensional Gaussian data p(x) ~ N (pu, X). 
Apply your data to each of the three possible pairings of two features for w1. 


(c) Modify your program to apply to three-dimensional Gaussian data. Apply your 
data to the full three-dimensional data for w1. 


(d) Assume your three-dimensional model is separable, so that © = diag(o7, 03,0%). 
Write a program to estimate the mean and the diagonal components of ©. Apply 
your program to the data in wə. 


(e) Compare your results for the mean of each feature u; calculated in the above 
ways. Explain why they are the same or different. 


(£) Compare your results for the variance of each feature g? calculated in the above 
ways. Explain why they are the same or different. 
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2. Consider a one-dimensional model of a triangular density governed by two scalar 
parameters: 


p(x/0) = T(p, 0) = { ue |x — p\)/6? for |x — u| <å 


otherwise, 


where 0 = (). Write a program to calculate a density p(z|D) via Bayesian methods 
(Eq. 26) and apply it to the x2 feature of category wa. Plot your resulting posterior 
density p(a|D). 
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3. Consider Bayesian estimation of the mean of a one-dimensional Gaussian. Suppose 
you are given the prior for the mean is p(w) ~ N (uo, 90). 
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(a) Write a program that plots the density p(a|D) given o, oo, o and training set 
D = {21,22,...,2n}. 


(b) Estimate ø for the x2 component of ws in the table above. Now assume uo = —1 
and plot your estimated densities p(a|D) for each of the following values of the 
dogmatism, 07/02: 0.1, 1.0, 10, 100. 
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4. Suppose we have reason to believe that our data is sampled from a two-dimensional 
uniform density 


1 


u a gz ÍO fu < T1 € Ly and Tz < T2 E Luz 
|@ui—@11||@u2—212| 


p(x|@) ~ U(x, Xu) = 
0 otherwise, 


where gy is the x; component of the “lower” bounding point x;, and analogously for 
the x2 component and for the upper point. Suppose we have reliable prior information 
that the density is zero outside the box defined by x; = (z6) and x, = E). Write 
a program that calculates p(x|D) via recursive Bayesian estimation and apply it to 
the 1, — x2 components of w1, in sequence, from the table above. For each expanding 


data set D” (2 < n < 10) plot your posterior density. 
O Section 3.6 


5. Write a single program to calculate sufficient statistics for any members of the 
exponential family (Eq. 69). Assume that the x3 data from wz in the table come from 
an exponential density, and use your program to calculate the sufficient statistics for 
each of the following exponential forms: Gaussian, Rayleigh and Maxwell. 
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6. Consider error rates in different dimensions. 


(a) Use maximum likelihood to train a dichotomizer using the three-dimensional 
data for categories wı and was in the Table above. Numerically integrate to 
estimate the classification error rate. 


(b) Now consider the data projected into a two-dimensional subspace. For each of 
the three subspaces — defined by zı = 0 or x2 = 0 or x3 = 0 — train a Gaussian 
dichotomizer. Numerically integrate to estimate the error rate. 


(c) Now consider the data projected onto one-dimensional subspaces, defined by 
each of the three axes. Train a Gaussian classifier, and numerically integrate to 
estimate the error rate. 


(d) Discuss the rank order of the error rates you find. 


(e) Assuming that you resestimate the distribution in the different dimensions, log- 
ically must the Bayes error be higher in the projected spaces. 
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7. Repeat the steps in Exercise 6 but for categories wı and ws. 
8. Consider the classification of Gaussian data employing shrinkage of covariance 
matrixes to a common one. 


(a) Generate 20 training points from each of three equally probable three-dimensional 
Gaussian distributions N(p,, 2,) with the following parameters: 


My = (0, 0,0), Y; = diag[3, 5, 2] 
1.0.0 
po=( 5-3). Ny 0 4 1 
0 1 6 
us = (0, 0,0)", E; = 101. 


(b) Write a program to estimate the means and covariances of your data. 


(c) Write a program that takes a and shrinks these estimated covariance matrixes 
according to Eq. 76. 


(d) Plot the training error as a function of a, where 0 <a < 1. 


(e) Use your program from part (a) to generate 50 test points from each category. 
Plot the test error as a function of a. 
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9. Suppose we know that the ten data points in category w; in the table above come 
from a three-dimensional Gaussian. Suppose, however, that we do not have access to 
the x3 components for the even-numbered data points. 


(a) Write an EM program to estimate the mean and covariance of the distribution. 
Start your estimate with u? = 0 and X° = I, the three-dimensional identity 
matrix. 


(b) Compare your final esimate with that for the case when there is no missing data. 


10. Suppose we know that the ten data points in category wa in the table above 
come from a three-dimensional uniform distribution p(x|w2) ~ U(x1,xXu). Suppose, 
however, that we do not have access to the 73 components for the even-numbered 
data points. 


(a) Write an EM program to estimate the six scalars comprising x; and x,, of the dis- 
tribution. Start your estimate with x; = (—2, —2, —2)' and x,, = (+2, +2, +2). 


(b) Compare your final esimate with that for the case when there is no missing data. 
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Write a program to evaluate the Bayesian belief net for fish in Example 3, including 
the information in P(x;|a;),P(x;|b;), P(c;[x;), and P(d;|x;). Test your program on 
the calculation given in the Example. Apply your program to the following cases, and 
state any assumptions you need to make. 
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(a) A dark, thin fish is caught in the north Atlantic in summer. 
probability it is a salmon? 


What is the 


(b) A thin, medium fish is caught in the north Atlantic. What is the probability it 
is winter? spring? summer? autumn? 


(c) A light, wide fish is caught in the autumn. What is the probability it came from 


the north Atlantic? 


Q Section 3.10 


11. Consider the use of hidden Markov models for classifying sequences of four visible 
states, A-D. Train two hidden Markov models, each consisting of three hidden states 
(plus a null initial state and a null final state), fully connected, with the following 
data. Assume that each sequence starts with a null symbol and ends with an end null 


symbol (not listed). 


sample | wy wa 
1 AABBCCDD DDCCBBAA 
2 ABBCBBDD DDABCBA 
3 ACBCBCD CDCDCBABA 
4 AD DDBBA 
5 ACBCBABCDD | DADACBBAA 
6 BABAADDD CDDCCBA 
7 BABCDCC BDDBCAAAA 
8 ABDBBCCDD BBABBDDDCD 
9 ABAAACDCCD | DDADDBCAA 
10 ABD DDCAAA 


(a) Print out the full transition matrices for each of the models. 


(b) Assume equal prior probabilities for the two models and classify each of the 


following sequences: ABBBCDDD, DADBCBAA, CDCBABA, and ADBBBCD. 


(c) As above, classify the test pattern BADBDCBA. Find the prior probabilities for your 
two trained models that would lead to equal posteriors for your two categories 
when applied to this pattern. 
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Chapter 4 


Nonparametric techniques 


4.1 Introduction 


n Chap. ?? we treated supervised learning under the assumption that the forms 
I of the underlying density functions were known. Alas, in most pattern recognition 
applications this assumption is suspect; the common parametric forms rarely fit the 
densities actually encountered in practice. In particular, all of the classical parametric 
densities are unimodal (have a single local maximum), whereas many practical prob- 
lems involve multimodal densities. Further, our hopes are rarely fulfilled that a high- 
dimensional density might be simply represented as the product of one-dimensional 
functions. In this chapter we shall examine nonparametric procedures that can be 
used with arbitrary distributions and without the assumption that the forms of the 
underlying densities are known. 

There are several types of nonparametric methods of interest in pattern recogni- 
tion. One consists of procedures for estimating the density functions p(x|w,;) from 
sample patterns. If these estimates are satisfactory, they can be substituted for the 
true densities when designing the classifier. Another consists of procedures for directly 
estimating the a posteriori probabilities P(w,|x). This is closely related to nonpara- 
metric design procedures such as the nearest-neighbor rule, which bypass probability 
estimation and go directly to decision functions. Finally, there are nonparametric 
procedures for transforming the feature space in the hope that it may be possible to 
employ parametric methods in the transformed space. These discriminant analysis 
methods include the Fisher linear discriminant, which provides an important link be- 
tween the parametric techniques of Chap. ?? and the adaptive techniques of Chaps. ?? 
& 22. 


4.2 Density estimation 
The basic ideas behind many of the methods of estimating an unknown probability 
density function are very simple, although rigorous demonstrations that the estimates 


converge require considerable care. The most fundamental techniques rely on the fact 
that the probability P that a vector x will fall in a region R is given by 


3 
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P= | p(x’) ax (1) 


R 


Thus P is a smoothed or averaged version of the density function p(x), and we can 
estimate this smoothed value of p by estimating the probability P. Suppose that n 
samples X1,..., Xn are drawn independently and identically distributed (i.i.d.) accord- 
ing to the probability law p(x). Clearly, the probability that k of these n fall in R is 
given by the binomial law 


P= (1) Pta Py, (2) 
and the expected value for k is 


Elk] = nP. (3) 


E 100 


0 P=.7 1 


Figure 4.1: The probability Pz of finding k patterns in a volume where the space 
averaged probability is P as a function of k/n. Each curve is labelled by the total 
number of patterns n. For large n, such binomial distributions peak strongly at 
k/n = P (here chosen to be 0.7). 


Moreover, this binomial distribution for k peaks very sharply about the mean, so that 
we expect that the ratio k/n will be a very good estimate for the probability P, and 
hence for the smoothed density function. This estimate is especially accurate when n 
is very large (Fig. 4.1). If we now assume that p(x) is continuous and that the region 
R is so small that p does not vary appreciably within it, we can write 


/ p(x!) dx! ~ p(x)V, (4) 


where x is a point within R and V is the volume enclosed by R. Combining Eqs. 1, 
3 & 4, we arrive at the following obvious estimate for p(x): 
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_ k/n 
p(x) = 2. (5) 
There are several problems that remain — some practical and some theoretical. 
If we fix the volume V and take more and more training samples, the ratio k/n will 
converge (in probability) as desired, but we have only obtained an estimate of the 
space-averaged value of p(x), 


p pple) de : 
oo tae (6) 
R 


If we want to obtain p(x) rather than just an averaged version of it, we must be 
prepared to let V approach zero. However, if we fix the number n of samples and let 
V approach zero, the region will eventually become so small that it will enclose no 
samples, and our estimate p(x) ~ 0 will be useless. Or if by chance one or more of 
the training samples coincide at x, the estimate diverges to infinity, which is equally 
useless. 

From a practical standpoint, we note that the number of samples is always limited. 
Thus, the volume V can not be allowed to become arbitrarily small. If this kind of 
estimate is to be used, one will have to accept a certain amount of variance in the 
ratio k/n and a certain amount of averaging of the density p(x). 

From a theoretical standpoint, it is interesting to ask how these limitations can 
be circumvented if an unlimited number of samples is available. Suppose we use the 
following procedure. To estimate the density at x, we form a sequence of regions 
R1,R2,.--, containing x — the first region to be used with one sample, the second 
with two, and so on. Let V,, be the volume of Rn, kn be the number of samples falling 
in Rp, and p,(x) be the nth estimate for p(x): 


kn/n 
Vn 


If p, (x) is to converge to p(x), three conditions appear to be required: 


(7) 


Pn (x) = 


e linV, =0 


n—>00 


e link, =o 
n—>00 


e limk,/n=0. 


The first condition assures us that the space averaged P/V will converge to p(x), 
provided that the regions shrink uniformly and that p(-) is continuous at x. The 
second condition, which only makes sense if p(x) 4 0, assures us that the frequency 
ratio will converge (in probability) to the probability P. The third condition is clearly 
necessary if p,(x) given by Eq. 7 is to converge at all. It also says that although a 
huge number of samples will eventually fall within the small region Rn, they will form 
a negligibly small fraction of the total number of samples. 

There are two common ways of obtaining sequences of regions that satisfy these 
conditions (Fig. 4.2). One is to shrink an initial region by specifying the volume V,, 
as some function of n, such as Vp = 1/./n. It then must be shown that the random 
variables kn and k,,/n behave properly, or more to the point, that p,(x) converges to 
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p(x). This is basically the Parzen-window method that will be examined in Sect. 4.3. 
The second method is to specify kn as some function of n, such as kn = yn. Here 
the volume V,, is grown until it encloses k, neighbors of x. This is the k,,-nearest- 
neighbor estimation method. Both of these methods do in fact converge, although it 
is difficult to make meaningful statements about their finite-sample behavior. 


n=1 2 3 10 


Poe es d ... ed ee’ oJ ro 


Figure 4.2: Two methods for estimating the density at a point x (at the center of 
each square) are to xxx. 


4.3 Parzen Windows 


The Parzen-window approach to estimating densities can be introduced by temporar- 
ily assuming that the region R,, is a d-dimensional hypercube. If h,, is the length of 
an edge of that hypercube, then its volume is given by 


Va = hê. (8) 
We can obtain an analytic expression for kn, the number of samples falling in the 
WINDOW hypercube, by defining the following window function: 
FUNCTION 
$ 1 luz] < 1/2 e dead 
aa) { 0 otherwise. (9) 


Thus, y(u) defines a unit hypercube centered at the origin. It follows that p((x — x;)/hn) 
is equal to unity if x; falls within the hypercube of volume V,, centered at x, and is 
zero otherwise. The number of samples in this hypercube is therefore given by 


n XX; 
kn = 5 p ( h ) > (10) 
i=1 ue 


and when we substitute this into Eq. 7 we obtain the estimate 


n 


meo = 2 eo (A), u) 


i=1 


This equation suggests a more general approach to estimating density functions. 
Rather than limiting ourselves to the hypercube window function of Eq. 9, suppose 
we allow a more general class of window functions. In such a case, Eq. 11 expresses 
our estimate for p(x) as an average of functions of x and the samples x;. In essence, 
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the window function is being used for interpolation — each sample contributing to 
the estimate in accordance with its distance from x. 

It is natural to ask that the estimate p, (x) be a legitimate density function, i.e., 
that it be nonnegative and integrate to one. This can be assured by requiring the 
window function itself be a density function. To be more precise, if we require that 


p(x) 20 (12) 
and 
few du = 1, (13) 
and if we maintain the relation V, = hf, then it follows at once that p,(x) also 
satisfies these conditions. 


Let us examine the effect that the window width hn has on p,(x). If we define the 
function ôn (x) by 


db) =e (7). (14) 


then we can write p,(x) as the average 


pn(x) = nx xi). (15) 


Since Vp, = hf, hn clearly affects both the amplitude and the width of ôn» (x) (Fig. 4.3). 
If hn is very large, the amplitude of 6, is small, and x must be far from x; before 
on(x — X;) changes much from 9,(0). In this case, p(x) is the superposition of n 
broad, slowly changing functions and is a very smooth “out-of-focus” estimate of 
p(x). On the other hand, if h, is very small, the peak value of 0,,(x — x;) is large and 
occurs near x = x;. In this case p(x) is the superposition of n sharp pulses centered 
at the samples — an erratic, “noisy” estimate (Fig. 4.4). For any value of h,,, the 
distribution is normalized, i.e., 


fi-a ax= | o (55) dx= fot du=1. (16) 


Thus, as h,, approaches zero, 0, (x—xX;¡) approaches a Dirac delta function centered at 
Xi, and p, (x) approaches a superposition of delta functions centered at the samples. 

Clearly, the choice of h,, (or Vp) has an important effect on p,(x). If Vp is too 
large, the estimate will suffer from too little resolution; if V,, is too small, the estimate 
will suffer from too much statistical variability. With a limited number of samples, the 
best we can do is to seek some acceptable compromise. However, with an unlimited 
number of samples, it is possible to let V,, slowly approach zero as n increases and 
have p,(x) converge to the unknown density p(x). 

In discussing convergence, we must recognize that we are talking about the con- 
vergence of a sequence of random variables, since for any fixed x the value of p,,(x) 
depends on the random samples X1,...,Xp. Thus, p,(x) has some mean p,(x) and 
variance 02 (x). We shall say that the estimate p,(x) converges to p(x) if 


lim p, (x) = p(x) (17) 


n—>00 
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Figure 4.3: Examples of two-dimensional circularly symmetric normal Parzen windows 
p(x/h) for three different values of h. Note that because the 6;,(-) are normalized, 
different vertical scales must be used to show their structure. 
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Figure 4.4: Three Parzen-window density estimates based on the same set of five 
samples, using the window functions in Fig. 4.3. As before, the vertical axes have 
been scaled to show the structure of each function. 


and 


lim oĉ (x) = 0. (18) 
n—>00 
To prove convergence we must place conditions on the unknown density p(x), on 
the window function y(u), and on the window width hn. In general, continuity of 
p(-) at x is required, and the conditions imposed by Eqs. 12 & 13 are customarily 
invoked. With care, it can be shown that the following additional conditions assure 
convergence (Problem 1): 


sup p(u) < œ (19) 


m p(u) II uj; =0 (20) 
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lim Vp = 0 (21) 
and 
lim nV, = 00. (22) 


Equations 19 & 20 keep p(-) well behaved, and are satisfied by most density functions 
that one might think of using for window functions. Equations 21 & 22 state that the 
volume V,, must approach zero, but at a rate slower than 1/n. We shall now see why 
these are the basic conditions for convergence. 


4.3.1 Convergence of the Mean 


Consider first p,(x), the mean of p,(x). Since the samples x; are i.i.d. according to 
the (unknown) density p(x), we have 


Pn(x) = Elpn(x)] 


| 
Se. 
5S- l 
6 
AS 
A 
Pi 
< 
a 
£ 
a 
< 


(23) 


Il 
— 
p 
E) 

l 
S 
= 
2 
Qa 
< 


This equation shows that the expected value of the estimate is an averaged value 
of the unknown density — a convolution of the unknown density and the window 
function (Appendix ??). Thus, p, (x) is a blurred version of p(x) as seen through the 
averaging window. But as V,, approaches zero, n(x — v) approaches a delta function 
centered at x. Thus, if p is continuous at x, Eq. 21 ensures that p,,(x) will approach 
p(x) as n approaches infinity. 


4.3.2 Convergence of the Variance 


Equation 23 shows that there is no need for an infinite number of samples to make 
Pn(x) approach p(x); one can achieve this for any n merely by letting V, approach 
zero. Of course, for a particular set of n samples the resulting “spiky” estimate is 
useless; this fact highlights the need for us to consider the variance of the estimate. 
Since p,,(x) is the sum of functions of statistically independent random variables, its 
variance is the sum of the variances of the separate terms, and hence 


CONVOLUTION 
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By dropping the second term, bounding y(-) and using Eq. 21, we obtain 


2 sup(p(-)) Pn(x) 
Aae (25) 
Clearly, to obtain a small variance we want a large value for Vp, not a small one — 
a large V, smooths out the local variations in density. However, since the numerator 
stays finite as n approaches infinity, we can let V,, approach zero and still obtain zero 
variance, provided that nV, approaches infinity. For example, we can let Vp, = V¡ /yn 
or VY; /In n or any other function satisfying Eqs. 21 & 22. 

This is the principal theoretical result. Unfortunately, it does not tell us how to 
choose p(-) and V, to obtain good results in the finite sample case. Indeed, unless we 
have more knowledge about p(x) than the mere fact that it is continuous, we have no 
direct basis for optimizing finite sample results. 


4.3.3 Illustrations 


It is interesting to see how the Parzen window method behaves on some simple ex- 
amples, and particularly the effect of the window function. Consider first the case 
where p(x) is a zero-mean, unit-variance, univariate normal density. Let the window 
function be of the same form: 


= a?) 
p(u) = an (26) 


Finally, let hn = h1/yn, where hı is a parameter at our disposal. Thus p,,(x) is an 
average of normal densities centered at the samples: 


pala) = E5 (E). (27) 


4=1 


While it is not hard to evaluate Eqs. 23 & 24 to find the mean and variance of 
Pn(x), it is even more interesting to see numerical results. When a particular set of 
normally distributed random samples was generated and used to compute p, (1), the 
results shown in Fig. 4.5 were obtained. These results depend both on n and hı. For 
n = 1, pn(x) is merely a single Gaussian centered about the first sample, which of 
course has neither the mean nor the variance of the true distribution. For n = 10 
and hı = 0.1 the contributions of the individual samples are clearly discernible; this 
is not the case for hı = 1 and hı = 5. As n gets larger, the ability of p,, (1) to resolve 
variations in p(x) increases. Concomitantly, pn(x) appears to be more sensitive to 
local sampling irregularities when n is large, although we are assured that p,,(a) will 
converge to the smooth normal curve as n goes to infinity. While one should not judge 
on visual appearance alone, it is clear that many samples are required to obtain an 
accurate estimate. Figure 4.6 shows analogous results in two dimensions. 

As a second one-dimensional example, we let p(x) and h, be the same as in 
Fig. 4.5, but let the unknown density be a mixture of two uniform densities: 


1 —2.5 < x < —2 
p(x) =< 1/4 0O<x<2 (28) 
0 otherwise. 


Figure 4.7 shows the behavior of Parzen-window estimates for this density. As before, 
the case n = 1 tells more about the window function than it tells about the unknown 
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h=1 h=.5 h=.1 
n=1 | 
-2 0 2 -2 0 2 -2 0 2 
-2 0 2 -2 0 2 -2 0 2 
-2 0 2 -2 0 2 -2 o 2 
-2 0 2 -2 0 2 -2 0 2 


Figure 4.5: Parzen-window estimates of a univariate normal density using different 
window widths and numbers of samples. The vertical axes have been scaled to best 
show the structure in each graph. Note particularly that the n = oo estimates are the 
same (and match the true generating function), regardless of window width h. 
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density. For n = 16, none of the estimates is particularly good, but results for n = 256 
and hı = 1 are beginning to appear acceptable. 
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Figure 4.7: Parzen-window estimates of a bimodal distribution using different window 
widths and numbers of samples. Note particularly that the n = oo estimates are the 
same (and match the true generating distribution), regardless of window width h. 


4.3.4 Classification example 


In classifiers based on Parzen-window estimation, we estimate the densities for each 
category and classify a test point by the label corresponding to the maximum poste- 
rior. If there are multiple categories with unequal priors we can easily include these 
too (Problem 4). The decision regions for a Parzen-window classifier depend upon 
the choice of window function, of course, as illustrated in Fig. 4.8. In general, the 
training error — the empirical error on the training points themselves — can be made 
arbitrarily low by making the window width sufficiently small.* However, the goal of 
creating a classifier is to classify novel patterns, and alas a low training error does 
not guarantee a small test error, as we shall explore in Chap. ??. Although a generic 
Gaussian window shape can be justified by considerations of noise, statistical inde- 
pendence and uncertainty, in the absense of other information about the underlying 
distributions there is little theoretical justification of one window width over another. 

These density estimation and classification examples illustrate some of the power 
and some of the limitations of nonparametric methods. Their power resides in their 
generality. Exactly the same procedure was used for the unimodal normal case and 
the bimodal mixture case and we did not need to make any assumptions about the 


* We ignore cases in which the same feature vector has been assigned to multiple categories. 
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| y | 
0.8 oe) $ 0.8 


Figure 4.8: The decision boundaries in a two-dimensional Parzen-window di- 
chotomizer depend on the window width h. At the left a small h leads to boundaries 
that are more complicated than for large h on same data set, shown at the right. 
Apparently, for this data a small h would be appropriate for the upper region, while 
a large h for the lower region; no single window width is ideal overall. 


distributions ahead of time. With enough samples, we are essentially assured of 
convergence to an arbitrarily complicated target density. On the other hand, the 
number of samples needed may be very large indeed — much greater than would be 
required if we knew the form of the unknown density. Little or nothing in the way of 
data reduction is provided, which leads to severe requirements for computation time 
and storage. Moreover, the demand for a large number of samples grows exponentially 
with the dimensionality of the feature space. This limitation is related to the “curse of 
dimensionality,” and severely restricts the practical application of such nonparametric 
procedures (Problem 11). The fundamental reason for the curse of dimensionality is 
that high-dimensional functions have the potential to be much more complicated than 
low-dimensional ones, and that those complications are harder to discern. The only 
way to beat the curse is to incorporate knowledge about the data that is correct. 


4.3.5 Probabilistic Neural Networks (PNNs) 


A hardware implementation of the Parzen windows approach is found in Probabilistic 
Neural Networks (Fig. 4.9). Suppose we wish to form a Parzen estimate based on n 
patterns, each of which is d-dimensional, randomly sampled from c classes. The PNN 
for this case consists of d input units comprising the input layer, each unit is connect 
to each of the n pattern units; each pattern unit is, in turn, connected to one and 
only one of the c category units. The connections from the input to pattern units 
represent modifiable weights, which will be trained. (While these weights are merely 
parameters and could be represented by a vector 6, in keeping with the established 
terminology in neural networks we shall use the symbol w.) Each link from a pattern 
unit to its associated category unit is of a single constant magnitude. 


The PNN is trained in the following way. First, each pattern x of the training set is 


INPUT UNIT 


PATTERN 
UNIT 


CATEGORY 
UNIT 


WEIGHT 
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category 


pattern 


input 


Figure 4.9: A probabilistic neural network (PNN) consists of d input units, n pattern 
units and c category units. Each pattern unit forms the inner product of its weight 
vector and the normalized pattern vector x to form z = wx, and then emits exp[(2 — 
1)/0?]. Each category unit sums such contributions from the pattern unit connected 
to it. This insures that the activity in each of the category units represents the Parzen- 
window density estimate using a circularly symmetric Gaussian window of covariance 
o7I, where I is the d x d identity matrix. 


d 
normalized to have unit length, i.e., is scaled so that X` x? = 1.* The first normalized 


training pattern is placed on the input units. The sociable weights linking the input 
units and the first pattern unit are set such that w, = x;. (Note that because of the 
normalization of xı, w1 is normalized too.) Then, a single connection from the first 
pattern unit is made to the category unit corresponding to the known class of that 
pattern. The process is repeated with each of the remaining training patterns, setting 
the weights to the successive pattern units such that wọ = x, for k = 1,2,...,n. 
After such training we have a network that is fully connected between input and 
pattern units, and sparsely connected from pattern to category units. If we denote 
the components of the jth pattern as 1; and the weights to the jth pattern unit wy x, 
for j = 1,2,...,n and k = 1,2,...,d, then our algorithm is: 


Algorithm 1 (PNN training) 


1 begin initialize j = 0,n = #patterns 
2 do j—j+1 


r 1/2 
eer 2 
normalize : tjk — Tjk/ | 925; 
2 


3 

4 train : Wjk — Tjk 

5 if x€w; then aj. — 1 
6 until 7 =n 


* Such normalization collapses two vectors having the same direction but different magnitude. In 
order to avoid this, we can augment the pattern with a feature of magnitude 1.0, making it (d+ 1)- 
dimensional, and then normalize. 
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7 end 


The trained network is then used for classification in the following way. A nor- 
malized test pattern x is placed at the input units. Each pattern unit computes the 
inner product 


Zk = WÍX, (29) 


and emits a nonlinear function of zx; each output unit sums the contributions from 
all pattern units connected to it. The nonlinear function is e@*~!)/ a where ø is a 
parameter set by the user and is equal to v2 times the width of the effective Gaussian 
window. To understand this choice of nonlinearity, consider an (unnormalized) Gaus- 
sian window centered on the position of one of the training patterns wz. We work 
backwards from the desired Gaussian window function to infer the nonlinear transfer 
function that should be employed by the pattern units. That is, if we let our effective 
width h, be a constant, the window function is 


desired Gaussian 
p 
(==) x e Wk)" (*-wr)/20? 


—(x'x-+w},w,—2x' w,)/207 — elr =1)/0° (30) 


transfer 


= e 


function 


where we have used our normalization conditions x'x = wt wx = 1. Thus each pattern 

unit contributes to its associated category unit a signal equal to the probability the 

test point was generated by a Gaussian centered on the associated training point. 

The sum of these local estimates (computed at the corresponding category unit) gives 

the discriminant function g;(x) — the Parzen window estimate of the underlying 

distribution. The max g;(x) operation gives the desired category for the test point 
2 


(Algorithm 2). 
Algorithm 2 (PNN classification) 


1 begin initialize k = 0,x = test pattern 

2 dok=k>+1 

3 Zk — WIX 

4 if ake = 1 then ge — ge + exp|(zk — 1)/o7] 
5 

6 


until k =n 
return class — arg max g;(x) 
2 


7 end 


One of the benefits of PNNs is their speed of learning, since the learning rule 
(i.e., setting wz = Xx) is simple and requires only a single pass through the training 
data. The space complexity (amount of memory) for the PNN is easy to determine by 
counting the number of wires in Fig. 4.9 — O((n+ 1)d). This can be quite severe for 
instance in a hardware application, since both n and d can be quite large. The time 
complexity for classification by the parallel implementation of Fig. 4.9 is O(1), since 
the n inner products of Eq. 29 can be done in parallel. Thus this PNN architecture 


PROTOTYPES 
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could find uses where recognition speed is important and storage is not a severe 
limitation. Another benefit is that new training patterns can be incorporated into 
a previously trained classifier quite easily; this might be important for a particular 
on-line application. 


4.3.6 Choosing the window function 


As we have seen, one of the problems encountered in the Parzen-window/PNN ap- 
proach concerns the choice of the sequence of cell volumes sizes V1, V2,... or overall 
window size (or indeed other window parameters, such as shape or orientation). For 
example, if we take Vp = V, / yn, the results for any finite n will be very sensitive to 
the choice for the initial volume Vi. If V, is too small, most of the volumes will be 
empty, and the estimate p, (x) will be very erratic (Fig. 4.7). On the other hand, if 
V, is too large, important spatial variations in p(x) may be lost due to averaging over 
the cell volume. Furthermore, it may well be the case that a cell volume appropriate 
for one region of the feature space might be entirely unsuitable in a different region 
(Fig. 4.8). In Chap. ?? we shall consider general methods, including cross-validation, 
which are often used in conjunction with Parzen windows. Now, though, we turn to an 
important alternative method that is both useful and has solvable analytic properties. 


4.4 k,—Nearest-Neighbor Estimation 


A potential remedy for the problem of the unknown “best” window function is to 
let the cell volume be a function of the training data, rather than some arbitrary 
function of the overall number of samples. For example, to estimate p(x) from n 
training samples or prototypes we can center a cell about x and let it grow until it 
captures kn samples, where kn is some specified function of n. These samples are 
the kn nearest-neighbors of x. It the density is high near x, the cell will be relatively 
small, which leads to good resolution. If the density is low, it is true that the cell will 
grow large, but it will stop soon after it enters regions of higher density. In either 
case, if we take 


kn /n 
Dn (x) = n/t 


we want kn to go to infinity as n goes to infinity, since this assures us that k,,/n 
will be a good estimate of the probability that a point will fall in the cell of volume 
Vn. However, we also want kn to grow sufficiently slowly that the size of the cell 
needed to capture kn training samples will shrink to zero. Thus, it is clear from 
Eq. 31 that the ratio k,,/n must go to zero. Although we shall not supply a proof, 
it can be shown that the conditions lim k, = oo and lim k,/n = 0 are necessary 


n—>00 n—00 


and sufficient for p, (x) to converge to p(x) in probability at all points where p(x) is 
continuous (Problem 5). If we take k,, = yn and assume that p, (x) is a reasonably 
good approximation to p(x) we then see from Eq. 31 that V, ~ 1/(./np(x)). Thus, 
V,, again has the form V¡ /y/n, but the initial volume V, is determined by the nature 
of the data rather than by some arbitrary choice on our part. Note that there are 
nearly always discontinuities in the slopes of these estimates, and these lie away from 
the prototypes themselves (Figs. 4.10 & 4.11). 

It is instructive to compare the performance of this method with that of the 
Parzen-window/PNN method on the data used in the previous examples. With n = 1 


(31) 
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p(x) 
4 


Figure 4.10: Eight points in one dimension and the k-nearest-neighbor density esti- 
mates, for k = 3 and 5. Note especially that the discontinuities in the slopes in the 
estimates generally occur away fom the positions of the points themselves. 


Figure 4.11: The k-nearest-neighbor estimate of a two-dimensional density for k = 5. 
Notice how such a finite n estimate can be quite “jagged,” and that discontinuities in 
the slopes generally occur along lines away from the positions of the points themselves. 
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Figure 4.12: Several k-nearest-neighbor estimates of two unidimensional densities: a 
Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite 
“spiky.” 


and kp = yn = 1, the estimate becomes 


1 


2|x — zıl (32) 


Pn(2) 


This is clearly a poor estimate of p(x), with its integral embarrassing us by diverging 
to infinity. As shown in Fig. 4.12, the estimate becomes considerably better as n gets 
larger, even though the integral of the estimate remains infinite. This unfortunate fact 
is compensated by the fact that pn(x) never plunges to zero just because no samples 
fall within some arbitrary cell or window. While this might seem to be a meager 
compensation, it can be of considerable value in higher-dimensional spaces. 

As with the Parzen-window approach, we could obtain a family of estimates by 
taking kn = k1yn and choosing different values for kı. However, in the absense of 
any additional information, one choice is as good as another, and we can be confident 
only that the results will be correct in the infinite data case. For classification, one 
popular method is to adjust the window width until the classifier has the lowest error 
on a separate set of samples, also drawn from the target distributions, a technique we 
shall explore in Chap. ??. 
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4.4.1 Estimation of a posteriori probabilities 


The techniques discussed in the previous sections can be used to estimate the a pos- 
teriori probabilities P(w;|x) from a set of n labelled samples by using the samples 
to estimate the densities involved. Suppose that we place a cell of volume V around 
x and capture k samples, k; of which turn out to be labelled w;. Then the obvious 
estimate for the joint probability p(x, wi) is 


Pal ux) = A (33) 


and thus a reasonable estimate for P(w;|x) is 


n Wi ki 
P, loilx) = -2 Ci ee (34) 
Pn(X, w;) 
j=l 


That is, the estimate of the a posteriori probability that w; is the state of nature is 
merely the fraction of the samples within the cell that are labelled w;. Consequently, 
for minimum error rate we select the category most frequently represented within the 
cell. If there are enough samples and if the cell is sufficiently small, it can be shown 
that this will yield performance approaching the best possible. 

When it comes to choosing the size of the cell, it is clear that we can use either 
the Parzen-window approach or the k,,-nearest-neighbor approach. In the first case, 
V, would be some specified function of n, such as Vp = 1/,/n. In the second case, 
V, would be expanded until some specified number of samples were captured, such 
as k = yn. In either case, as n goes to infinity an infinite number of samples will fall 
within the infinitely small cell. The fact that the cell volume could become arbitrarily 
small and yet contain an arbitrarily large number of samples would allow us to learn 
the unknown probabilities with virtual certainty and thus eventually obtain optimum 
performance. Interestingly enough, we shall now see that we can obtain comparable 
performance if we base our decison solely on the label of the single nearest neighbor 
of x. 


4.5 The Nearest-Neighbor Rule 


While the k-nearest-neighbor algorithm was first proposed for arbitrary k, the crucial 
matter of determining the error bound was first solved for k = 1. This nearest- 
neighbor algorithm has conceptual and computational simplicity. We begin by letting 
D” = {x1,...,Xn} denote a set of n labelled prototypes, and x’ € D” be the prototype 
nearest to a test point x. Then the nearest-neighbor rule for classifying x is to assign 
it the label associated with x’. The nearest-neighbor rule is a sub-optimal procedure; 
its use will usually lead to an error rate greater than the minimum possible, the Bayes 
rate. We shall see, however, that with an unlimited number of prototypes the error 
rate is never worse than twice the Bayes rate. 

Before we get immersed in details, let us try to gain a heuristic understanding of 
why the nearest-neighbor rule should work so well. To begin with, note that the label 
0’ associated with the nearest neighbor is a random variable, and the probability 
that 0” = wi is merely the a posteriori probability P(w;|x’). When the number of 
samples is very large, it is reasonable to assume that x’ is sufficiently close to x that 
P(w|x’) ~ P(w;|x). Since this is exactly the probability that nature will be in state 
wi, the nearest-neighbor rule is effectively matching probabilities with nature. 


VORONOI 
TESSELATION 
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If we define wm(x) by 


P(Wm|X) = max P(w;|x), (35) 


then the Bayes decision rule always selects wm. This rule allows us to partition the 
feature space into cells consisting of all points closer to a given training point x’ than 
to any other training points. All points in such a cell are thus labelled by the category 
of the training point — a so-called Voronoi tesselation of the space (Fig. 4.13). 


Figure 4.13: In two dimensions, the nearest-neighbor algorithm leads to a partitioning 
of the input space into Voronoi cells, each labelled by the category of the training point 
it contains. In three dimensions, the cells are three-dimensional, and the decision 
boundary resembles the surface of a crystal. 


When P(w,,|x) is close to unity, the nearest-neighbor selection is almost always 
the same as the Bayes selection. That is, when the minimum probability of error 
is small, the nearest-neighbor probability of error is also small. When P(w,,|x) is 
close to 1/c, so that all classes are essentially equally likely, the selections made by 
the nearest-neighbor rule and the Bayes decision rule are rarely the same, but the 
probability of error is approximately 1 — 1/c for both. While more careful analysis 
is clearly necessary, these observations should make the good performance of the 
nearest-neighbor rule less surprising. 

Our analysis of the behavior of the nearest-neighbor rule will be directed at ob- 
taining the infinite-sample conditional average probability of error P(e|x), where the 
averaging is with respect to the training samples. The unconditional average proba- 
bility of error will then be found by averaging P(e|x) over all x: 


P(e) = J Pioc) dx. (36) 


In passing we should recall that the Bayes decision rule minimizes P(e) by minimizing 
P(e|x) for every x. Recall from Chap. ?? that if we let P*(e|x) be the minimum 
possible value of P(e|x), and P* be the minimum possible value of P(e), then 


P*(elx) = 1 — P(wm|x) (37) 
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and 


P= J P(e) dx. (38) 


4.5.1 Convergence of the Nearest Neighbor 


We now wish to evaluate the average probability of error for the nearest-neighbor 
rule. In particular, if P,,(e) is the n-sample error rate, and if 


P= lim P,(e), (39) 
then we want to show that 
P* < P<P*(2- —P*). (40) 
= 


We begin by observing that when the nearest-neighbor rule is used with a par- 
ticular set of n samples, the resulting error rate will depend on the accidental char- 
acteristics of the samples. In particular, if different sets of n samples are used to 
classify x, different vectors x’ will be obtained for the nearest-neighbor of x. Since 
the decision rule depends on this nearest-neighbor, we have a conditional probability 
of error P(e|x, x’) that depends on both x and x’. By averaging over x’, we obtain 


P(e|x) = SS dx’. (41) 


where we understand that there is an implicit dependence upon the number n of 
training points. 

It is usually very difficult to obtain an exact expression for the conditional density 
p(x'|x). However, since x’ is by definition the nearest-neighbor of x, we expect this 
density to be very peaked in the immediate vicinity of x, and very small elsewhere. 
Furthermore, as n goes to infinity we expect p(x’|x) to approach a delta function 
centered at x, making the evaluation of Eq. 41 trivial. To show that this is indeed the 
case, we must assume that at the given x, p(-) is continuous and not equal to zero. 
Under these conditions, the probability that any sample falls within a hypersphere S 
centered about x is some positive number P,: 


P; = f mx) dx’. (42) 
x’/ES 

Thus, the probability that all n of the independently drawn samples fall outside 
this hypersphere is (1 — P,)", which approaches zero as n goes to infinity. Thus x’ 
converges to x in probability, and p(x’|x) approaches a delta function, as expected. In 
fact, by using measure theoretic methods one can make even stronger (as well as more 
rigorous) statements about the convergence of x’ to x, but this result is sufficient for 
our purposes. 


4.5.2 Error Rate for the Nearest-Neighbor Rule 


We now turn to the calculation of the conditional probability of error P,,(e|x,x’). 
To avoid a potential source of confusion, we must state the problem with somewhat 
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greater care than has been exercised so far. When we say that we have n inde- 
pendently drawn labelled samples, we are talking about n pairs of random variables 
(x1, 01), (X2, 02), ..., (Xn, An), where 0, may be any of the c states of nature w},...,We. 
We assume that these pairs were generated by selecting a state of nature w; for 0, with 
probability P(w;) and then selecting an x; according to the probability law p(x|w,), 
with each pair being selected independently. Suppose that during classification nature 
selects a pair (x,0), and that x}, labelled 0, is the training sample nearest x. Since 
the state of nature when xi was drawn is independent of the state of nature when x 
is drawn, we have 


P(0,0!.1x, x!) = P(0|x)P(0/|x/,). (43) 


Now if we use the nearest-neighbor decision rule, we commit an error whenever 0 4 0. 


Thus, the conditional probability of error P, (e|x, x4) is given by 


P,(elx;x,) = 1= 5 P(0 = wi, 0 =w,|x,x;) 
i=1 
= 1-) P(wilx)P(wilx)). (44) 
i=1 


To obtain P, (e) we must substitute this expression into Eq. 41 for P,,(e|x) and 
then average the result over x. This is very difficult, in general, but as we remarked 
earlier the integration called for in Eq. 41 becomes trivial as n goes to infinity and 
p(x'|x) approaches a delta function. If P(w;|x) is continuous at x, we thus obtain 


lim Pa(elx) = J [1 - Y PoP’) 5(x! — x) dx’ 


_ ye P? (wlx). (45) 


Therefore, provided we can exchange some limits and integrals, the asymptotic nearest- 
neighbor error rate is given by 


P = lim P,(e) 


n—>00 


= lim | P,(elx)p(x) dx 


n—>00 


= J [1 = > P? (wil) | p(x) dx. (46) 


4.5.3 Error Bounds 


While Eq. 46 presents an exact result, it is more illuminating to obtain bounds on P in 
terms of the Bayes rate P*. An obvious lower bound on P is P* itself. Furthermore, 
it can be shown that for any P* there is a set of conditional and prior probabilities 
for which the bound is achieved, so in this sense it is a tight lower bound. 
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The problem of establishing a tight upper bound is more interesting. The basis 
for hoping for a low upper bound comes from observing that if the Bayes rate is 
low, P(w;|x) is near 1.0 for some i, say i = m. Thus the integrand in Eq. 46 is 
approximately 1 — P?(wm|x) ~ 2(1 — P(wm|x)), and since 


P*(elx) = 1 — P(wm|x), (47) 


integration over x might yield about twice the Bayes rate, which is still low and 

acceptable for some applications. To obtain an exact upper bound, we must find out 

how large the nearest-neighbor error rate P can become for a given Bayes rate P*. 
c 

Thus, Eq. 46 leads us to ask how small Y) P?(w;|x) can be for a given P(wm|x). First 


i= 
we write 


>> P?(wilx) = P?(wmlx) + D 7 P?(wilx), (48) 
i=l ism 

and then seek to bound this sum by minimizing the second term subject to the 

following constraints: 


e P(w;|x) > 0 


e x P(w;|x) = 1 — P(wm|x) = P* (ex). 


With a little thought we see that X P?(w;|x) is minimized if all of the a posteriori 


i=1 
probabilities except the mth are equal. The second constraint yields 


P* (elx) 4 
Puilx) =} 2, cli (49) 
1 — P*(e|x) i=m. 
Thus we have the inequalities 
(a pr 
Y Po) > (1 Prep + PPS (50 
i=1 
and 
1- Y P*(wilx) < 2P*(elx) — —P™(elx). (51) 


i=1 

This immediately shows that P < 2P*, since we can substitute this result in 
Eq. 46 and merely drop the second term. However, a tighter bound can be obtained 
by observing that the variance is: 


Var[P*(e|x)] J [P* (elx) — P*Pp(x) dx 


J Piep) dx — P? >0, 


so that 
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J P(e) dx > P”, (52) 


with equality holding if and only if the variance of P*(e|x) is zero. Using this result 
and substituting Eq. 51 into Eq. 46, we obtain the desired bounds on the nearest- 
neighbor error P in the case of an infinite number of samples: 


PY<P<P*(2- — P*). (53) 


c= 


It is easy to show that this upper bound is achieved in the so-called zero-information 
case in which the densities p(x|w;) are identical, so that P(w;|x) = P(w;) and further- 
more P*(e|x) is independent of x (Problem 17). Thus the bounds given by Eq. 53 are 
as tight as possible, in the sense that for any P* there exist conditional and a priori 
probabilities for which the bounds are achieved. In particular, the Bayes rate P* can 
be anywhere between 0 and (c— 1)/c and the bounds meet at the two extreme values 
for the probabilities. When the Bayes rate is small, the upper bound is approximately 
twice the Bayes rate (Fig. 4.14). 


Figure 4.14: Bounds on the nearest-neighbor error rate P in a c-category problem 
given infinite training data, where P* is the Bayes error (Eq. 53). At low error rates, 
the nearest-neighbor error rate is bounded above by twice the Bayes rate. 


Since P is always less than or equal to 2P*, if one had an infinite collection of data 
and used an arbitrarily complicated decision rule, one could at most cut the error rate 
in half. In this sense, at least half of the classification information in an infinite data 
set resides in the nearest neighbor. 

It is natural to ask how well the nearest-neighbor rule works in the finite-sample 
case, and how rapidly the performance converges to the asymptotic value. Unfor- 
tunately, despite prolonged effort on such problems, the only statements that can 
be made in the general case are negative. It can be shown that convergence can 
be arbitrarily slow, and the error rate P,(e) need not even decrease monotonically 
with n. As with other nonparametric methods, it is difficult to obtain anything other 
than asymptotic results without making further assumptions about the underlying 
probability structure (Problems 13 & 14). 


4.5.4 The k-Nearest-Neighbor Rule 


An obvious extension of the nearest-neighbor rule is the k-nearest-neighbor rule. As 
one would expect from the name, this rule classifies x by assigning it the label most 
frequently represented among the k nearest samples; in other words, a decision is made 
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by examining the labels on the k nearest neighbors and taking a vote (Fig. 4.15). We 
shall not go into a thorough analysis of the k-nearest-neighbor rule. However, by 
considering the two-class case with k odd (to avoid ties), we can gain some additional 
insight into these procedures. 


X2 


> Kij 


Figure 4.15: The k-nearest-neighbor query starts at the test point and grows a spher- 
ical region until it encloses k training samples, and labels the test point by a majority 
vote of these samples. In this k = 5 case, the test point x would be labelled the 
category of the black points. 


The basic motivation for considering the k-nearest-neighbor rule rests on our ear- 
lier observation about matching probabilities with nature. We notice first that if 
k is fixed and the number n of samples is allowed to approach infinity, then all of 
the k nearest neighbors will converge to x. Hence, as in the single-nearest-neighbor 
cases, the labels on each of the k-nearest-neighbors are random variables, which in- 
dependently assume the values w; with probabilities P(w;|x),i = 1,2. If P(wm|x) 
is the larger a posteriori probability, then the Bayes decision rule always selects wm. 
The single-nearest-neighbor rule selects wm with probability P(wm|x). The k-nearest- 
neighbor rule selects wm if a majority of the k nearest neighbors are labeled wm, an 
event of probability 


k 


5 (5) P(wm|x) [1 — P(wm|x)]*~*. (54) 


i=(k+1)/2 


In general, the larger the value of k, the greater the probability that w,, will be 
selected. 

We could analyze the k-nearest-neighbor rule in much the same way that we 
analyzed the single-nearest-neighbor rule. However, since the arguments become more 
involved and supply little additional insight, we shall content ourselves with stating 
the results. It can be shown that if k is odd, the large-sample two-class error rate for 
the k-nearest-neighbor rule is bounded above by the function C;,(P*), where C;,(P*) 
is defined to be the smallest concave function of P* greater than 


(k-1)/2 


E Pa- P]. (55) 


F 2 
1=0 
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Here the summation over the first bracketed term represents the probability of error 
due to i points coming from the category having the minimum probability and k—i > i 
points from the other category. The summation over the second term in the brackets 
is the probability that k — i points are from the minimum-probability category and 
i+1< k-— i from the higher probability category. Both of these cases constitute 
errors under the k-nearest-neighbor decision rule, and thus we must add them to find 
the full probability of error (Problem 18). 

Figure 4.16 shows the bounds on the k-nearest-neighbor error rates for several 
values of k. As k increases, the upper bounds get progressively closer to the lower 
bound — the Bayes rate. In the limit as k goes to infinity, the two bounds meet and 
the k-nearest-neighbor rule becomes optimal. 


Figure 4.16: The error-rate for the k-nearest-neighbor rule for a two-category problem 
is bounded by C;,(P*) in Eq. 55. Each curve is labelled by k; when k = oo, the 
estimated probabilities match the true probabilities and thus the error rate is equal 
to the Bayes rate, i.e., P = P*. 


At the risk of sounding repetitive, we conclude by commenting once again on the 
finite-sample situation encountered in practice. The k-nearest-neighbor rule can be 
viewed as another attempt to estimate the a posteriori probabilities P(w;|x) from 
samples. We want to use a large value of k to obtain a reliable estimate. On the 
other hand, we want all of the k nearest neighbors x’ to be very near x to be sure 
that P(w,|x”) is approximately the same as P(w,;|x). This forces us to choose a 
compromise k that is a small fraction of the number of samples. It is only in the limit 
as n goes to infinity that we can be assured of the nearly optimal behavior of the 
k-nearest-neighbor rule. 


4.5.5 Computational Complexity of the k—Nearest-Neighbor 


Rule 
The computational complexity of the nearest-neighbor algorithm — both in space 
(storage of prototypes) and time (search) — has received a great deal of analy- 


sis. There are a number of elegant theorems from computational geometry on the 
construction of Voronoi tesselations and nearest-neighbor searches in one- and two- 
dimensional spaces. However, because the greatest use of nearest-neighbor techniques 
is for problems with many features, we concentrate on the more general d-dimensional 
case. 

Suppose we have n labelled training samples in d dimensions, and seek to find 
the closest to a test point x (k = 1). In the most naive approach we inspect each 
stored point in turn, calculate its Euclidean distance to x, retaining the identity only 
of the current closest one. Each distance calculation is O(d), and thus this search 
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a 


Figure 4.17: A parallel nearest-neighbor circuit can perform search in constant — 
i.e., O(1) — time. The d-dimensional test pattern x is presented to each box, which 
calculates which side of a cell’s face x lies on. If it is on the “close” side of every face 
of a cell, it lies in the Voronoi cell of the stored pattern, and receives its label. 


is O(dn?). An alternative but straightforward parallel implementation is shown in 
Fig. 4.17, which is O(1) in time and O(n) in space. 

There are three general algorithmic techniques for reducing the computational 
burden in nearest-neighbor searches: computing partial distances, prestructuring, and 
editing the stored prototypes. In partial distance, we calculate the distance using some 
subset r of the full d dimensions, and if this partial distance is too great we do not 
compute further. The partial distance based on r selected dimensions is 


E 1/2 
D,(a,b) = (do -= n) (56) 


k=1 


where r < d. Intuitively speaking, partial distance methods assume that what we 
know about the distance in a subspace is indicative of the full space. Of course, the 
partial distance is strictly non-decreasing as we add the contributions from more and 
more dimensions. Consequently, we can confidently terminate a distance calculation 
to any prototype once its partial distance is greater than the full r = d Euclidean 
distance to the current closest prototype. 

In presturcturing we create some form of search tree in which prototypes are selec- 
tively linked. During classification, we compute the distance of the test point to one 
or a few stored “entry” or “root” prototypes and then consider only the prototypes 
linked to it. Of these, we find the one that is closest to the test point, and recursively 
consider only subsequent linked prototypes. If the tree is properly structured, we will 
reduce the total number of prototypes that need to be searched. 

Consider a trivial illustration of prestructuring in which we store a large number 
of prototypes that happen to be distributed uniformly in the unit square, i.e., p(x) ~ 


PARTIAL 
DISTANCE 


SEARCH 
TREE 


EDITING 
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U (0), ()). Imagine we prestructure this set using four entry or root prototypes — 


0 
at Ga Gak RA and a) — each fully linked only to points in its corresponding 


quadrant. When a test pattern x is presented, the closest of these four prototypes 
is determined, and then the search is limited to the prototypes in the corresponding 
quadrant. In this way, 3/4 of the prototypes need never be queried. 

Note that in this method we are no longer guaranteed to find the closest prototype. 
For instance, suppose the test point is near a boundary of the quadrants, e.g., x = 
al In this particular case only prototypes in the first quadrant will be searched. 
Note however that the closest prototype might actually be in one of the other three 
quadrants, somewhere near (55). This illustrates a very general property in pattern 
recognition: the tradeoff of search complexity against accuracy. 


More sophisticated search trees will have each stored prototype linked to a small 
number of others, and a full analysis of these methods would take us far afield. Nev- 
ertheless, here too, so long as we do not query all training prototypes, we are not 
guaranteed that the nearest prototype will be found. 

The third method for reducing the complexity of nearest-neighbor search is to 
eliminate “useless” prototypes during training, a technique known variously as editing, 
pruning or condensing. A simple method to reduce the O(n) space complexity is to 
eliminate prototypes that are surrounded by training points of the same category 
label. This leaves the decision boundaries — and hence the error — unchanged, while 
reducing recall times. A simple editing algorithm is as follows. 


Algorithm 3 (Nearest-neighbor editing) 


1 begin initialize j = 0, D = data set, n = #prototypes 

2 construct the full Voronoi diagram of D 

3 do j + j +1; for each prototype x} 

4 Find the Voronoi neighbors of x’; 

5 if any neighbor is not from the same class as xj then mark xi 
6 

7 

8 

9 


until j =n 
Discard all points that are not marked 
Construct the Voronoi diagram of the remaining (marked) prototypes 
end 


The complexity of this editing algorithm is O(d?nl¢/2/Inn), where here the “floor” 
operation (|:|) implies |d/2| = k if d is even, and 2k — 1 if d is odd (Problem 10). 

According to Algorithm 3, if a prototype contributes to a decision boundary (i.e., 
at least one of its neighbors is from a different category), then it remains in the set; 
otherwise it is edited away (Problem 15). This algorithm does not guarantee that the 
minimal set of points is found (Problem 16), nevertheless, it is one of the examples in 
pattern recognition in which the computational complexity can be reduced — some- 
times significantly — without affecting the accuracy. One drawback of such pruned 
nearest neighbor systems is that one generally cannot add training data later, since 
the pruning step requires knowledge of all the training data ahead of time (Computer 
exercise ??). We conclude this section by noting the obvious, i.e., that we can com- 
bine these three complexity reduction methods. We might first edit the prototypes, 
then form a search tree during training, and finally compute partial distances during 
classification. 
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4.6 Metrics and Nearest-Neighbor Classification 


The nearest-neighbor classifier relies on a metric or “distance” function between pat- 
terns. While so far we have assumed the Euclidean metric in d dimensions, the notion 
of a metric is far more general, and we now turn to the use alternate measures of 
distance to address key problems in classification. First let us review the properties of 
a metric. A metric D(-,-) is merely a function that gives a generalized scalar distance 
between two argument patterns. A metric must have four properties: for all vectors 
a, b and c 


non-negativity: D(a, b) > 0 

reflexivity: D(a,b) = 0 if and only if a = b 
symmetry: D(a, b) = D(b,a) 

triangle inequality: D(a,b) + D(b,c) > D(a,c). 


It is easy to verify that if the Euclidean formula for distance in d dimensions, 


d 1/2 
D(a,b) = (dtm = n) f (57) 


k=1 
obeys the properties of metric. Moreover, if each coordinate is multiplied by an 
arbitrary constant, the resulting space also obeys a metric (Problem 19), though it 
can lead to problems in nearest-neighbor classifiers (Fig. 4.18). 


Xx X2 


A 


> x, > OX, 


Figure 4.18: Even if each coordinate is scaled by some constant, the resulting space 
still obeys the properties of a metric. However, a nearest-neighbor classifier would 
have different results depending upon such rescaling. Consider the test point x and 
its nearest neighbor. In the original space (left), the black prototype is closest. In 
the figure at the right, the xı axis has been rescaled by a factor 1/3; now the nearest 
prototype is the red one. If there is a large disparity in the ranges of the full data in 
each dimension, a common procedure is to rescale all the data to equalize such ranges, 
and this is equivalent to changing the metric in the original space. 


One general class of metrics for d-dimensional patterns is the Minkowski metric 


E 1/k 
L;(a, b) = (>. la; — n) , (58) 


also referred to as the L; norm (Problem 20); thus, the Euclidean distance is the La 
norm. The Lı norm is sometimes called the Manhattan or city block distance, the 
shortest path between a and b, each segment of which is parallel to a coordinate axis. 


MINKOWSI 
METRIC 
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(The name derives from the fact that the streets of Manhattan run north-south and 
east-west.) Suppose we compute the distances between the projections of a and b 
onto each of the d coordinate axes. The La distance between a and b corresponds 
to the maximum of these projected distances (Fig. 4.19). 


Figure 4.19: Each colored surface consists of points a distance 1.0 from the origin, 
measured using different values for k in the Minkowski metric (k is printed in red). 
Thus the white surfaces correspond to the Lı norm (Manhattan distance), light gray 
the Lz norm (Euclidean distance), dark gray the L4 norm, and red the Lo norm. 


The Tanimoto metric finds most use in taxonomy, where the distance between two 
sets is defined as 


—2 
Dranimoto(S1, S2) = he al (59) 


ni + Na — N42 


where nı and na are the number of elements in sets S¡ and Sa, respectively, and n;2 is 
the number that is in both sets. The Tanimoto metric finds greatest use for problems 
in which two patterns or features — the elements in the set — are either the same or 
different, and there is no natural notion of graded similarity (Problem 27). 

The selection among these or other metrics is generally dictated by computational 
concerns, and it is hard to base a choice on prior knowledge about the distributions. 
One exception is when there is great difference in the range of the data along different 
axes in a multidmensional data. Here, we should scale the data — or equivalently 
alter the metric — as suggested in Fig. 4.18. 


4.6.1 Tangent distance 


There may be drawbacks inherent in the uncritical use of a particular metric in 
nearest-neighbor classifiers, and these drawbacks can be overcome by the careful use 
of more general measures of distance. On crucial such problem is that of invariance. 
Consider a 100-dimensional pattern x’ representing a 10 x 10 pixel grayscale image of 
a handwritten 5. Consider too the Euclidean distance from x’ to the pattern repre- 
senting an image that is shifted horizontally but otherwise identical (Fig. 4.20). Even 
if the relative shift is a mere three pixels, the Euclidean distance grows very large — 
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much greater than the distance to an unshifted 8. Clearly the Euclidean metric is of 
little use in a nearest-neighbor classifier that must be insensitive to such translations. 

Likewise, other transformations, such as overall rotation or scale of the image, 
would not be well accommodated by Euclidean distance in this manner. Such draw- 
backs are especially pronounced if we demand that our classifier be simultaneously 
invariant to several transformations, such as horizontal translation, vertical transla- 
tion, overall scale, rotation, line thickness, shear, and so on (Computer exercise 7 & 
8). While we could preprocess the images by shifting their centers to coalign, then 
have the same bounding box, and so forth, such an approach has its own difficulties, 
such as sensitivity to outlying pixels or to noise. We explore here alternatives to such 
preprocessing. 


Xg w X'(s=3) 


D(x,x(s)) 
4 


D(x) x, 
Bis i 


Figure 4.20: The uncritical use of Euclidean metric cannot address the problem of 
translation invariance. Pattern x’ represents a handwritten 5, and x/(s = 3) the same 
shape but shifted three pixels to the right. The Euclidean distance D(x’, x’(s = 3)) is 
much larger than D(x’, xg), where xg represents the handwritten 8. Nearest-neighbor 
classification based on the Euclidean distance in this way leads to very large errors. 
Instead, we seek a distance measure that would be insensitive to such translations, or 
indeed other known invariances, such as scale or rotation. 


Ideally, during classification we would like to first transform the patterns to be 
as similar to one another and only then compute their similarity, for instance by 
the Euclidean distance. Alas, the computational complexity of such transformations 
make this ideal unattainable. Merely rotating a k x k image by a known amount and 
interpolating to a new grid is O(k?). But of course we do not know the proper rotation 
angle ahead of time and must search through several values, each value requiring a 
distance calculation to test the whether the optimal setting has been found. If we must 
search for the optimal set of parameters for several transformations for each stored 
prototype during classification, the computational burden is prohibitive (Problem 25). 

The general approach in tangent distance classifiers is to use a novel measure of 
distance and a linear approximation to the arbitrary transforms. Suppose we believe 
there are r transformations applicable to our problem, such as horizontal translation, 
vertical translation, shear, rotation, scale, and line thinning. During construction of 


TANGENT 
VECTOR 
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the classifier we take each stored prototype x’ and perform each of the transformations 
Fi(x’; ai) on it. Thus F;(x’; ai) could represent the image described by x’, rotated 
by a small angle a;. We then construct a tangent vector TV; for each transformation: 


TV; = Fi(x’; ai) = x’. (60) 


While such a transformation may be compute intensive — as, for instance, the line 
thinning transform — it need be done only once, during training when computational 
constraints are lax. In this way we construct for each prototype x’ an r x d matrix 
T, consisting of the tangent vectors at x’. (Such vectors can be orthonormalized, but 
we need assume here only that they are linearly independent.) It should be clear, too 
that this method will not work with binary images, since they lack a proper notion 
of derivative. If the data are binary, then, it is traditional to blur the images before 
creating a tangent distance based classifier. 

Each point in the subspace spanned by the r tangent vectors passing through 
x’ represents the linearized approximation to the full combination of transforms, as 
shown in Fig. 4.21. During classification we search for the point in the tangent space 
that is closest to a test point x — the linear approximation to our ideal. As we shall 
see, this search can be quite fast. 

Now we turn to computing the tangent distance from a test point x to a particular 
stored prototype x’. Formally, given a matrix T consisting of the r tangent vectors 
at x’, the tangent distance from x’ to x is 


Dean(x!,x) = min|||(x’ + Ta) — xll], (61) 


i.e., the Euclidean distance from x to the tangent space of x’. Equation 61 describes 
the so-called “one-sided” tangent distance, because only one pattern, x’, is trans- 
formed. The two-sided tangent distance allows both x and x’ to be transformed but 
improves the accuracy only slightly at a large added computational burden (Prob- 
lem 23); for this reason we shall concentrate on the one-sided version. 

During classification of x we will find its tangent distance to x’ by finding the 
optimizing value of a required by Eq. 61. This minimization is actually quite simple, 
since the argument is a paraboloid as a function of a, as shown in pink in Fig. 4.22. 
We find the optimal a via iterative gradient descent. For gradient descent we need 
the derivative of the (squared) Euclidean distance. The Euclidean distance in Eq. 61 
obeys 


DAX! + Ta, x) = ||(x' + Ta) — x||?, (62) 
and we compute the gradient with respect to the vector of parameters a — the pro- 
jections onto the tangent vectors — as 

VaD*(x' + Ta, x) = 2T*(x’ + Ta — x). (63) 
Thus we can start with an arbitrary a and take a step in the direction of the negative 
gradient, updating our parameter vector as 

a(t + 1) = a(t) — nT'(Ta(t) + x’ — x), (64) 


where 77 is the scalar step size controlling the rate of convergence. So long as the step 
is not too large, we will reduce the squared Euclidean distance. When the minimum 
of such Euclidean distance is found, we have our tangent distance (Eq. 61). The 
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Figure 4.21: The pixel image of the handwritten 5 prototype at the lower left was 
subjected to two transformations, rotation, and line thinning, to obtain the tangent 
vectors TV, and TV2; images corresponding to these tangent vectors are shown out- 
side the axes. Each of the 16 images within the axes represents the prototype plus 
linear combination of the two tangent vectors with coefficients a, and az. The small 
red number in each image is the Euclidean distance between the tangent approxi- 
mation and the image generated by the unapproximated transformations. Of course, 
this Euclidean distance is O for the prototype and for the cases a; = 1,a2 = 0 and 
a, = 0,a2 = 1. (The patterns generated with a; + az > 1 have a gray background 
because of automatic grayscale conversion of images with negative pixel values.) 


optimal a can also be found by standard matrix methods, but these generally have 
higher computational complexities, as is explored in Problems 21 & 22. We note that 
the methods for editing and prestructuring data sets described in Sec. 4.5.5 can be 
applied to tangent distance classifers too. 


Nearest-neighbor classifiers using tangent distance have been shown to be highly 
accurate, but they require the designer to know which invariances and to be able to 
perform them on each prototype. Some of the insights from tangent approach can 
also be used for learning which invariances underly the training data — a topic we 
shall revisit in Chap. ??. 
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X 


Figure 4.22: A stored prototype x’, if transformed by combinations of two basic 
transformations, would fall somewhere on a complicated curved surface in the full 
d-dimensional space (gray). The tangent space at x’ is an r-dimensional Euclidean 
space, spanned by the tangent vectors (here TV, and TV2). The tangent distance 
Dian(x”, x) is the smallest Euclidean distance from x to the tangent space of x’, shown 
in the solid red lines for two points, x; and x2. Thus although the Euclidean distance 
from x’ to x, is less than to x3, for the tangent distance the situation is reversed. The 
Euclidean distance from xa to the tangent space of x’ is a quadratic function of the 
parameter vector a, as shown by the pink paraboloid. Thus simple gradient descent 
methods can find the optimal vector a and hence the tangent distance Dian(x”, X2). 


4.7 Fuzzy Classification 


Occassionally we may have informal knowledge about a problem domain where we 
seek to build a classifier. For instance, we might feel, generally speaking, that an 
adult salmon is oblong and light in color, while a sea bass is stouter and dark. The 
approach taken in fuzzy classification is to create so-called “fuzzy category member- 
ships functions,” which convert an objectively measurable parameter into a subjective 
“category memberships,” which are then used for classification. We must stress im- 
mediately that the term “categories” used by fuzzy practitioners refers not to the final 
class as we have been discussing, but instead just overlapping ranges of feature values. 
For instance, if we consider the feature value of lightness, fuzzy practitioners might 
split this into five “categories” — dark, medium-dark, medium, medium-light and 
light. In order to avoid misunderstandings, we shall use quotations when discussing 
such “categories.” 

For example we might have the lightness and shape of a fish be judged as in 
Fig. 4.23. Next we need a way to convert an objective measurement in several features 
into a category decision about the fish, and for this we need a merging or conjunction 
rule — a way to take the “category memberships” (e.g., lightness and shape) and 
yield a number to be used for making the final decision. Here fuzzy practitioners have 
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Figure 4.23: “Category membership” functions, derived from the designer’s prior 
knowledge, together with a lead to discriminants. In this figure x might represent an 
objectively measureable value such as the reflectivity of a fish’s skin. The designer 
believes there are four relevant ranges, which might be called dark, medium-dark, 
medium-light and light. Note, the memberships are not in true categories we wish to 
classify, but instead merely ranges of feature values. 


at their disposal a large number of possible functions. Indeed, most functions can be 
used and there are few principled criteria to preference one over another. One guiding 
principle that is often invoked is that that in the extreme cases the membership 
functions have value 0 or 1, the conjunction reduces to standard predicate logic; 
likewise, symmetry in the arguments is virtually always assumed. Nevertheless, there 
are no strong principled reasons to impose these conditions, nor are they sufficient to 
determine the “categories.” 

Suppose the designer feels that the final category based on lightness ahd shape can 
be described as medium-light and oblong. While the heuristic category membership 
function (u(-)) converts the objective measurements to two “category memberships,” 
we now need a conjunction rule to transform the component “membership values” 
into a discriminant function. There are many ways to do this, but the most popular 
is 


1 — Min|He(x), uy (y)). (65) 


and the obvious extension if there are more then two features. 

It must be emphasized that fuzzy techniques are completely and thoroughly sub- 
sumed by the general notion of discriminant function discussed in Chap. ?? (Prob- 
lem 29). 


4.7.1 Are Fuzzy Category Memberships just Probabilities? 


Even before the introduction of fuzzy methods and category membership functions, 
the statistics, pattern recognition and even mathematical philosophy communities ar- 
gued a great deal over the fundamental nature of probability. Some questioned the 
applicability of the concept to single, non-repeatable events, feeling that statements 
about a single event — what was the probability of rain on Tuesday? — were mean- 
ingless. Such discussion made it quite clear that “probability” need not apply only 
to repeatable events. Instead, since the first half of the 20th century, probability has 
been used as the logic of reasonable inference — work that highlighted the notion of 
subjective probability. Moreover, pattern recognition practitioners had happily used 
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Figure 4.24: “Category membership” functions and a conjunction rule based on the 
designer’s prior knowledge lead to discriminant functions. Here xı and zs are objec- 
tively measurable feature values. The designer believes that a particular class can be 
described as the conjunction of two “category memberships,” here shown bold. Here 
the conjunction rule of Eq. 65 is used to give the discriminant function. The resulting 
discriminant function for the final category is indicated by the grayscale in the middle: 
the greater the discriminant, the darker. The designer constructs discriminant func- 
tions for other categories in a similar way (possibly also using disjunctions). During 
classification, the maximum discriminant function is chosen. 


discriminant functions without concern over whether they represented probabilities, 
subjective probabilities, approximations to frequencies, or other fundamental entities. 

While a full analysis of these topics would lead us away from our development of 
pattern recognition techniques, it pays to consider the claims of fuzzy logic proponents, 
since in order to be a good pattern recognition practitioner, we must understand what 
is or is not afforded by any technique. Proponents of fuzzy logic are adamant that 
category membership functions do not represent probabilities — subjective or not. 
Fuzzy practitioners point to examples such as when a half teaspoon of sugar is placed 
in a cup of tea, and conclude that the “membership” in the category sweet is 0.5, and 
that it would be incorrect to state that the probability the tea was sweet was 50%. 
But this situation be viewed simply as some sweetness feature value is 0.5, and there 
is some discriminant function, whose arguments include this feature value. One need 
not entertain xxx 

Rather than debate the fundamental nature of probability, we should really be 
concerned with the nature of inference, i.e., how we take measurements and infer a 
category. Cox’s axioms — sometimes called Cox ;aynesaxioms — — — are 


1. If P(ald) > P(b|d) and P(b|d) > P(c|d) then P(ald) > P(c|d). That is, degrees 
of belief have a natural ordering, given by real numbers. 


2. P(not ajd) = F¡[P(a|d)]. That is, the degree of belief that a proposition is not 
the case is some function of the degree of belief that it is the case. Note that 
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such degrees of belief are graded values. 
3. P(a,bld) = F>[P(ald), P(bla, d)] 


The first axiom states merely that the probability of not having proposition b 
given a, is some function Fi of the probability of b given a. The second, though not 
as evident, is 

From these two, along with classical inference, we get the laws of probability. Any 
consistent inference method is formally equivalent to standard probabilistic inference. 

In spite of the arguments on such foundational issues, many practitioners are happy 
to use fuzzy logic feeling that “whatever works” should be part of their repertoire. It 
is important, therefore, to understand the methodological strengths and limitations 
of the method. The limitations are formidable: 


e Fuzzy methods are of very limited use in high dimensions or on complex prob- 
lems. Pure fuzzy methods contribute little or nothing to problems with dozens 
or hundreds of features, and where there is training data. 


e The amount of information the designer can be expected to bring to a problem 
is quite limited — the number, positions and widths of “category memberships.” 


e Because of their lack of normalization, pure fuzzy methods are poorly suited to 
problems in which there is a changing cost matrix \;; (Computer exercise 9). 


e Pure fuzzy methods do not make use of training data. When such pure fuzzy 
methods (as outlined above) have unacceptable performance, it has been tradi- 
tional to try to graft on adaptive (e.g., “neuro-fuzzy” ) methods. 


If there is a contribution of fuzzy approaches to pattern recognition, it would lie 
in giving the steps by which one takes knowledge in a linguistic form and casts it 
into discriminant functions. It is unlikely that the verbal knowledge could extend to 
problems with dozens — much less hundreds — of features, the domain of the majority 
of real-world pattern recognition problems. A severe limitation of pure fuzzy methods 
is they do not rely on data, and when unsatisfactory results on problems of moderate 
size, it has been traditional to try to use neural or other adaptive techniques to 
compensate. At best, these are equivalent to maximum likelihood methods. 


4.8 Relaxation methods 


We have seen how the Parzen-window method uses a fixed window throughout the 
feature space, and that this could lead to difficulties: in some regions a small window 
width was appropriate while elsewhere a large one would be best. The k-nearest- 
neighbor method addressed this problem by adjusting the region based on the density 
of the points. Informally speaking, an approach that is intermediate between these 
two is to adjust the size of the window during training according to the distance to the 
nearest point of a different category. This is the method of some relaxation techniques. 
(The term “relaxation” refers to the underlying mathematical techniques for setting 
the parameters; we will consider only such relaxation issues, and concentrate instead 
on their effects.) 

The simplest method is that of potential functions — which merely consists of an 
interpolation function. The difference with Parzen windows is that the magnitude 
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of each is adjusted so as to properly classify the training data. One representative 

method — called the reduced coulomb energy or RCE network — has the form shown REDUCED 

in Fig. 4.25, which has the same topology as a Probabilistic neural network (Fig. 4.9). COULOMB 
ENERGY 


category 


pattern 


input 


Figure 4.25: An RCE network is topologically equivalent to the PNN of Fig. 4.9. Dur- 
ing training the wghts are adjusted to have the same values as the pattern presented, 
just as in a PNN. However, pattern units in an RCE network also have a modifiable 
“radius” parameter A. During training, each A is adjusted so that the region is as 
large as possible without containing training patterns from a different category. 


The primary difference is that in an RCE network each pattern unit has an ad- 
justable parameter that corresponds to the radius of the d-dimensional sphere. During 
training, each radius is adjusted so that each pattern unit covers a region as large as 
possible without containing a training point from another category. 


Algorithm 4 (RCE training) 


1 begin initialize j = 0,n = #patterns, e = small param, Am = max radius 
2 do j—j+1 


3 train weight: w;, — Tk 

4 find nearest pt not in w;: x — arg mi D(x,x’) 
xéwi 

5 set radius: àj — Min[D(x,x’) — €, Am] 

6 if x € wi then aj, 1 

7 until j =n 

8 end 


There are several subtleties that we need not consider right here. For instance, if 
the radius of a pattern unit becomes too small (i.e., less than some threshold Amin), 
then it indicates that different categories are highly overlapping. In that case, the 
pattern unit is called a “probabilistic” unit, and so marked. 

During classification, a test point is classified by the label of any point is by 
presenting the unit, getting activation. If probabilistic units overlap, Any region that 
is overlapped is considered ambiguous. Such ambiguous regions can be useful, since 
the teacher can be queried as to the identity of points in that region. If we continue 
to let A; be the radius around stored prototype Xx; and now let D, be the set of stored 
prototypes in whose hypershperes test point x lies, then our classification algorithm 
is written as: 


Algorithm 5 (RCE classification) 
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1 begin initialize j = 0, k = 0,x = test pattern, D; = {} 

2 doj=j+1 

3 if D(x,x;) <A; then D; — D¿Ux; 

4 until j =n 

5 if cat of all xj € D, is the same then return label of all x, € Di 
6 else return “ambiguous” label 
7 end 


Figure 4.26: During training, each pattern has a parameter — equivalent to a radius 
in the d-dimensional space — that is adjusted to be as large as possible, without 
enclosing any points from a different category. As new patterns are presented, each 
such radius is decreased accordingly (and can never increase). In this way, each 
pattern unit can enclose several prototypes, but only those having the same category 
label. The number of points is shown in each component figure. The figure at the 
bottom shows the final complicated decision regions, colored by category. 
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4.9 Approximations by Series Expansions 


The nonparametric methods described thus far suffer from the requirement that in 
general all of the samples must be stored or that the designer have extensive knowledge 
of the problem. Since a large number of samples is needed to obtain good estimates, 
the memory requirements can be severe. In addition, considerable computation time 
may be required each time one of the methods is used to estimate p(x) or classify a 
new X. 

In certain circumstances the Parzen-window procedure can be modified to reduce 
these problems considerably. The basic idea is to approximate the window function 
by a finite series expansion that is acceptably accurate in the region of interest. If 
we are fortunate and can find two sets of functions 4,(x) and x;(x) that allow the 
expansion 


(=)= y aj; (x)x; (xi), (66) 


then we can split the dependence upon x and x; as 


n 


NAS) = > ele ula (67) 


i=1 j=1 i=1 


Then from Eq. 11 we have 


where 


b; = We do xi (xi). (69) 


If a sufficiently accurate expansion can be obtained with a reasonable value for 
m, this approach has some obvious advantages. The information in the n samples 
is reduced to the m coefficients b;. If additional samples are obtained, Eq. 69 for 
b; can be updated easily, and the number of coefficients remains unchanged. If the 
functions 4,(-) and x,(-) are polynomial functions of the components of x and xj, 
the expression for the estimate p,(x) is also a polynomial, which can be computed 
relatively efficiently. Furthermore, use of this estimate p(x|w;)P(w;) leads to a simple 
way of obtaining polynomial discriminant functions. 

Before becoming too enthusiastic, however, we should note one of the problems 
with this approach. A key property of a useful window function is its tendency 
to peak at the origin and fade away elsewhere. Thus p((x — x;)/h,) should peak 
sharply at x = x;, and contribute little to the approximation of p,,(x) for x far from 
Xi. Unfortunately, polynomials have the annoying property of becoming unbounded. 
Thus, in a polynomial expansion we might find the terms associated with an x; far 
from x contributing most (rather than least) to the expansion. It is quite important, 
therefore, to be sure that the expansion of each windown function is in fact accurate 
in the region of interest, and this may well require a large number of terms. 

There are many types of series expansions one might consider. Readers familiar 
with integral equations will naturally interpret Eq. 66 as an expansion of the kernel 
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(p(x, Xi) in a series of eigenfunctions. (In analogy with eigenvectors and eigenvalues, 
eigenfunctions are solutions to certain differential equations with fixed real-number 
coefficients.) Rather than computing eigenfunctions, one might choose any reasonable 
set of functions orthogonal over the region of interest and obtain a least-squares fit 
to the window function. We shall take an even more straightforward approach and 
expand the window function in a Taylor series. For simplicity, we confine our attention 
to a one-dimensional example using a Gaussian window function: 


Va plu) = e” 


2 
M 
a 
18 


j=0 


This expansion is most accurate near u = 0, and is in error by less than u27/m!. If 
we substitute u = (x — x2,)/h, we obtain a polynomial of degree 2(m— 1) in x and zi. 
For example, if m = 2 the window function can be approximated as 


Top) 


ve) 


and thus 


apala) = LY ro (0) = bo + bra + baa, (70) 


where the coefficients are 
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This simple expansion condenses the information in n samples into the values, 
bo, bı, and ba. It is accurate if the largest value of |x — x;| is not greater than h. 
Unfortunately, this restricts us to a very wide window that is not capable of much 
resolution. By taking more terms we can use a narrower window. If we let r be the 
largest value of |x — x;| and use the fact that the error is the m-term expansion of 
vr p((x — 2;)/h) is less than (r/h)?”m!, then using Stirling’s approximation for m! 
we find that the error in approximating p,,(x) is less than 


r/ sd e r\2]™ 
a ARA an) 


Thus, the error becomes small only when m > e(r/h)?. This implies the need for 
many terms if the window size h is small relative to the distance r from x to the most 
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distant sample. Although this example is rudimentary, similar considerations arise 
in the multidimensional case even when more sophisticated expansions are used, and 
the procedure is most attractive when the window size is relatively large. 


4.10 Fisher Linear Discriminant 


One of the recurring problems encountered in applying statistical techniques to pat- 
tern recognition problems has been called the “curse of dimensionality.” Procedures 
that are analytically or computationally manageable in low-dimensional spaces can be- 
come completely impractical in a space of 50 or 100 dimensions. Pure fuzzy methods 
are particularly ill-suited to such high-dimensional problems since it is implausible 
that the designer’s linguistic intuition extends to such spaces. Thus, various tech- 
niques have been developed for reducing the dimensionality of the feature space in 
the hope of obtaining a more manageable problem. 

We can reduce the dimensionality from d dimensions to one dimension if we merely 
project the d-dimensional data onto a line. Of course, even if the samples formed 
well-separated, compact clusters in d-space, projection onto an arbitrary line will 
usually produce a confused mixture of samples from all of the classes, and thus poor 
recognition performance. However, by moving the line around, we might be able to 
find an orientation for which the projected samples are well separated. This is exactly 
the goal of classical discriminant analysis. 

Suppose that we have a set of n d-dimensional samples xj, ..., Xn, n1 in the subset 
Dı labelled wı and nə in the subset Da labelled wa. If we form a linear combination 
of the components of x, we obtain the scalar dot product 


y = wx (72) 


and a corresponding set of n samples y;,..., Yn divided into the subsets Y, and Və. 
Geometrically, if ||w|| = 1, each y; is the projection of the corresponding x; onto a 
line in the direction of w. Actually, the magnitude of w is of no real significance, 
since it merely scales y. The direction of w is important, however. If we imagine 
that the samples labelled w: fall more or less into one cluster while those labelled we 
fall in another, we want the projections falling onto the line to be well separated, not 
thoroughly intermingled. Figure 4.27 illustrates the effect of choosing two different 
values for w for a two-dimensional example. It should be abundantly clear that if the 
original distributions are multimodal and highly overlapping, even the “best” w is 
unlikely to provide adequate seaparation, and thus this method will be of little use. 

We now turn to the matter of finding the best such direction w, one we hope will 
enable accurate classification. A measure of the separation between the projected 
points is the difference of the sample means. If m; is the d-dimensional sample mean 
given by 


m; = L ae (73) 


1 xED; 


then the sample mean for the projected points is given by 


Mi = =D y 


* yeyi 
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Figure 4.27: Projection of samples onto two different lines. The figure on the right 
shows greater separation between the red and black projected points. 
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and is simply the projection of m,. 
It follows that the distance between the projected means is 
Im — ñ| = [w'"(m, — m>)|, (75) 


and that we can make this difference as large as we wish merely by scaling w. Of 
course, to obtain good separation of the projected data we really want the difference 
between the means to be large relative to some measure of the standard deviations for 
each class. Rather than forming sample variances, we define the scatter for projected 
samples labelled w; by 


#= Y (y—mi)?. (76) 


yEdY; 


Thus, (1/n)(5? + 32) is an estimate of the variance of the pooled data, and 3% + 32 
is called the total within-class scatter of the projected samples. The Fisher linear 
discriminant employs that linear function w*x for which the criterion function 


[ma — m|? 
J(w) = EE (77) 
is maximum (and independent of ||w||). While the w maximizing J(-) leads to the 
best separation between the two projected sets (in the sense just described), we will 
also need a threshold criterion before we have a true classifier. We first consider how 
to find the optimal w, and later turn to the issue of thresholds. 

To obtain J(-) as an explicit function of w, we define the scatter matrices S; and 
Sw by 


(78) 


and 
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Sw = S; + S2. (79) 
Then we can write 


= XO (w'x — wim)? 


xED; 
= Y w(x-m;)(x—m,)w 
xED; 
= w' Siw; (80) 
therefore the sum of these scatters can be written 
#455 =w'Syww. (81) 


Similarly, the separations of the projected means obeys 


(Mı = ma)? = (w'm, == wms)? 


II 


w'(m; — m2)(m; — m2)'w 
= wSgw, (82) 


where 


Sp = (mı = ma)(my = ma). (83) 


We call Sw the within-class scatter matrix. It is proportional to the sample co- 
variance matrix for the pooled d-dimensional data. It is symmetric and positive 
semidefinite, and is usually nonsingular if n > d. Likewise, Sz is called the between- 
class scatter matriz. It is also symmetric and positive semidefinite, but because it is 
the outer product of two vectors, its rank is at most one. In particular, for any w, 
S gw is in the direction of mı — ma, and Sz is quite singular. 

In terms of Sg and Sy, the criterion function J(-) can be written as 


w'Spw 


J(w) = (84) 


wSww 
This expression is well known in mathematical physics as the generalized Rayleigh 
quotient. It is easy to show that a vector w that maximizes J(-) must satisfy 


Spw = A\Sww, (85) 


for some constant A, which is a generalized eigenvalue problem (Problem 36). This 
can also be seen informally by noting that at an extremum of J(w) a small change in 
w in Eq. 84 should leave unchanged the ratio of the numerator to the denominator. 
If Sw is nonsingular we can obtain a conventional eigenvalue problem by writing 


Sy Spw = àw. (86) 


In our particular case, it is unnecessary to solve for the eigenvalues and eigenvectors 
of SiS p due to the fact that S gw is always in the direction of mı — ma. Since the 
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scale factor for w is immaterial, we can immediately write the solution for the w that 
optimizes J(-): 


w = Sy (mı — m»). (87) 


Thus, we have obtained w for Fisher’s linear discriminant — the linear function 
yielding the maximum ratio of between-class scatter to within-class scatter. (The 
solution w given by Eq. 87 is sometimes called the canonical variate.) Thus the 
classification has been converted from a d-dimensional problem to a hopefully more 
manageable one-dimensional one. This mapping is many-to-one, and in theory can not 
possibly reduce the minimum achievable error rate if we have a very large training set. 
In general, one is willing to sacrifice some of the theoretically attainable performance 
for the advantages of working in one dimension. All that remains is to find the 
threshold, i.e., the point along the one-dimensional subspace separating the projected 
points. 

When the conditional densities p(x|w;) are multivariate normal with equal co- 
variance matrices X, we can calculate the threshold directly. In that case we recall 
(Chap. ??, Sect. ??) that the optimal decision boundary has the equation 


w'x + wo = 0 (88) 


where 


w = D~! (p — Ho), (89) 


and where wọ is a constant involving w and the prior probabilities. If we use sample 
means and the sample covariance matrix to estimate u; and X, we obtain a vector 
in the same direction as the w of Eq. 89 that maximized J(-). Thus, for the normal, 
equal-covariance case, the optimal decision rule is merely to decide w4 if Fisher’s linear 
discriminant exceed some threshold, and to decide wa otherwise. More generally, if 
we smooth the projected data, or fit it with a univariate Gaussian, we then should 
choose wọ where the posteriors in the one dimensional distributions are equal. 

The computational complexity of finding the optimal w for the Fisher linear dis- 
criminant (Eq. 87) is dominated by the calculation of the within-category total scatter 
and its inverse, an O(d?n) calculation. 


4.11 Multiple Discriminant Analysis 


For the c-class problem, the natural generalization of Fisher's linear discriminant 
involves c — 1 discriminant functions. Thus, the projection is from a d-dimensional 
space to a (c — 1)-dimensional space, and it is tacitly assumed that d > c. The 
generalization for the within-class scatter matrix is obvious: 


=e (90) 
i=1 


where, as before, 
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m; = z ae (92) 


E xED; 


The proper generalization for Sg is not quite so obvious. Suppose that we define 
a total mean vector m and a total scatter matriz Sr by 


m=1Yx=2Y nm, (93) 
x ¿=1 


and 


Sr = Y (x— m)(x - m}. (94) 


Sr = dd = mi +m; m)(x — m; + m; —m)’ 


1=1 xED; 
= 5 5 (x — m;)(x — m;)' 4 5 (m; — m)(m; — m)' 
i=1 xED; i=1 xE D; 
= Sw+ D ni(m; — m)(m; — m). (95) 


It is natural to define this second term as a general between-class scatter matrix, 
so that the total scatter is the sum of the within-class scatter and the between-class 
scatter: 


Sp = 5 ni(m; — m) (m; — m)' (96) 
i=1 
and 
Sr = Sw + Spg. (97) 


If we check the two-class case, we find that the resulting between-class scatter matrix 
is nin2/n times our previous definition. * 

The projection from a d-dimensional space to a (c — 1)-dimensional space is ac- 
complished by c — 1 discriminant functions 


Yi = wix t=1,..,c-1. (98) 


If the y; are viewed as components of a vector y and the weight vectors w; are viewed 
as the columns of a d-by-(c — 1) matrix W, then the projection can be written as a 
single matrix equation 


y = W'x. (99) 


* We could redefine Spg for the two-class case to obtain complete consistency, but there should be 
no misunderstanding of our usage. 
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The samples x1,..., Xy project to a corresponding set of samples yj,..., yn, which 
can be described by their own mean vectors and scatter matrices. Thus, if we define 


eas 5 y (100) 
yeni 
O S- S 
ns 
i=] 
Sw = (y — mi)(y — m; )' (102) 
i=l yEy; 
and 

Se = Y n,(m, — ñ) (mñ; — m)’, (103) 


it is a straightforward matter to show that 


Sw = W'SyW (104) 


and 


Sp = W'S¿W. (105) 


These equations show how the within-class and between-class scatter matrices are 
transformed by the projection to the lower dimensional space (Fig. 4.28). What we 
seek is a transformation matrix W that in some sense maximizes the ratio of the 
between-class scatter to the within-class scatter. A simple scalar measure of scatter 
is the determinant of the scatter matrix. The determinant is the product of the 
eigenvalues, and hence is the product of the “variances” in the principal directions, 
thereby measuring the square of the hyperellipsoidal scattering volume. Using this 
measure, we obtain the criterion function 


Pal a 
Sy]  [W*SyW!I| 


(106) 


The problem of finding a rectangular matrix W that maximizes J(-) is tricky, 
though fortunately it turns out that the solution is relatively simple. The columns of 
an optimal W are the generalized eigenvectors that correspond to the largest eigen- 
values in 


Sgw; = AiSw Wi- (107) 


A few observations about this solution are in order. First, if Sw is non-singular, 
this can be converted to a conventional eigenvalue problem as before. However, this 
is actually undesirable, since it requires an unnecessary computation of the inverse of 
Sw. Instead, one can find the eigenvalues as the roots of the characteristic polynomial 


ISg — A:Sw| = 0 (108) 


and then solve 
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Figure 4.28: Three three-dimensional distributions are projected onto two-dimensional 
subspaces, described by a normal vectors wı and wa. Informally, multiple discrimi- 
nant methods seek the optimum such subspace, i.e., the one with the greatest sepa- 
ration of the projected distributions for a given total within-scatter matrix, here as 
associated with wy. 


(Sp E AiSw)wi =0 (109) 


directly for the eigenvectors w;. Because Spg is the sum of c matrices of rank one or 
less, and because only c— 1 of these are independent, Spg is of rank c— 1 or less. Thus, 
no more than c — 1 of the eigenvalues are nonzero, and the desired weight vectors 
correspond to these nonzero eigenvalues. If the within-class scatter is isotropic, the 
eigenvectors are merely the eigenvectors of Sg, and the eigenvectors with nonzero 
eigenvalues span the space spanned by the vectors m; — m. In this special case the 
columns of W can be found simply by applying the Gram-Schmidt orthonormalization 
procedure to the c— 1 vectors m; — m, i = 1,...,c— 1. Finally, we observe that in 
general the solution for W is not unique. The allowable transformations include 
rotating and scaling the axes in various ways. These are all linear transformations 
from a (c — 1)-dimensional space to a (c — 1)-dimensional space, however, and do not 
change things in any significant way; in particular, they leave the criterion function 
J(W) invariant and the classifier unchanged. 

If we have very little data, we would tend to project to a subspace of low dimen- 
sion, while if there is more data, we can use a higher dimension, as we shall explore 
in Chap. ??. Once we have projected the distributions onto the optimal subspace 
(defined as above), we can use the methods of Chapt. ?? to create our full classifier. 

As in the two-class case, multiple discriminant analysis primarily provides a reason- 
able way of reducing the dimensionality of the problem. Parametric or nonparametric 
techniques that might not have been feasible in the original space may work well in 
the lower-dimensional space. In particular, it may be possible to estimate separate 
covariance matrices for each class and use the general multivariate normal assump- 
tion after the transformation where this could not be done with the original data. In 
general, if the transformation causes some unnecessary overlapping of the data and 
increases the theoretically achievable error rate, then the problem of classifying the 
data still remains. However, there are other ways to reduce the dimensionality of 
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data, and we shall encounter this subject again in Chap. ??. We note that there are 
also alternate methods of discriminant analysis — such as the selection of features 
based on statistical sigificance — some of which are given in the references for this 
chapter. Of these, Fisher’s method remains a fundamental and widely used technique. 


Summary 


There are two overarching approaches to non-parametric estimation for pattern clas- 
sification: in one the densities are estimated (and then used for classification), in the 
other the category is chosen directly. The former approach is exemplified by Parzen 
windows and their hardware implementation, Probabilistic neural networks. The lat- 
ter is exemplified by k-nearest-neighbor and several forms of relaxation networks. In 
the limit of infinite training data, the nearest-neighbor error rate is bounded from 
above by twice the Bayes error rate. The extemely high space complexity of the 
nominal nearest-neighbor method can be reduced by editing (e.g., removing those 
prototypes that are surrounded by prototypes of the same category), prestructuring 
the data set for efficient search, or partial distance calculations. Novel distance mea- 
sures, such as the tangent distance, can be used in the nearest-neighbor algorithm for 
incorporating known tranformation invariances. 

Fuzzy classification methods employ heuristic choices of “category membership” 
and heuristic conjunction rules to obtain discriminant functions. Any benefit of such 
techniques is limited to cases where there is very little (or no) training data, small 
numbers of features, and when the knowledge can be gleaned from the designer’s prior 
knowledge. 

Relaxation methods such as potential functions create “basins of attraction” sur- 
rounding training prototypes; when a test pattern lies in such a basin, the corre- 
sponding prototype can be easily identified along with its category label. Reduced 
coloumb energy networks are one in the class of such relaxation networks, the basins 
are adjusted to be as large as possible yet not include prototypes from other categories. 

The Fisher linear discriminant finds a good subspace in which categories are best 
separated; other techniques can then be applied in the subspace. Fisher’s method 
can be extended to cases with multiple categories projected onto subspaces of higher 
dimension than a line. 


Bibliographical and Historical Remarks 


Parzen introduced his window method for estimating density functions [32], and its 
use in regression was pioneered by Ndaraya and Watson [?, ?]. Its natural application 
to classification problems stems from the work of Specht [39], including its PNN 
hardware implementation [40]. 

Nearest-neighbor methods were first introduced by [16, 17], but it was over fifteen 
years later that computer power had increased, thereby making it practical and re- 
newing interest in its theoretical foundations. Cover and Hart’s foundational work 
on asymptotic bounds [10] were expanded somewhat through the analysis of Devroye 
[14]. The first pruning or editing work in [23] was followed by a number of related al- 
gorithms, such as that described in [5, 3]. The k-nearest neighbor was explored in [33]. 
The computational complexity of nearest neighbor (Voronoi) is described in [35]; work 
on search, as described in [27], has proven to be of greater use, in general. Much of 
the work on reducing the computational complexity of nearest-neighbor search comes 
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from the vector quantization and compression community; for instance partial dis- 
tance calculation are described in [21]. Friedman has an excellent analysis of some of 
the unintuitive properties of high dimensional spaces, and indirectly nearest neighbor 
classifiers, an inspiration for several problems here [19]. The definitive collection of 
seminal papers in nearest-neighbor classification is [12]. 

The notion of tangent distance was introduced by Simard and colleagues [38], and 
explored by a number of others [24]. Sperduti and Stork introduced a prestructuring 
and novel search criterion which speeds search in tangent based classifiers [41]. The 
greatest successes of tangent methods have been in optical character recognition, but 
the method can be applied in other domains, so long as the invariances are known. 
The study of general invariance has been most profitable when limited to a particular 
domain, and readers seeking further background should consult [31] for computer 
vision and [34] for speech. Background on image transformations is covered in [18]. 

The philosophical debate concerning frequency, probability, graded category mem- 
bership, and so on, has a long history [29]. Keynes espoused a theory of probability as 
the logic of probable inference, and did not need to rely on the notion of repeatability, 
frequency, etc. We subscribe to the traditional view that probability is a conceptual 
and formal relation between hypotheses and conclusions — here, specifically between 
data and category. The limiting cases of such rational belief are certainty (on the 
one hand), and impossibility (on the other). Classical theory of probability cannot be 
based solely on classical logic, which has no formal notions for the probability of an 
event. While the rules in Keynes’ probability [26] were taken as axiomatic, Cox [11] 
and later Jayne[?] sought to place a formal underpinning. 

Many years after these debates, “fuzzy” methods were proposed from the com- 
puter science [43]. A formal equivalence of fuzzy category membership functions and 
probability is given in [22], which in turn is based on Cox [11]. Cheeseman has made 
some remarkably clear and forceful rebuttals to the assertions that fuzzy methods 
represent something beyond the notion of subjective probability [7, 8]; representative 
expositions to the contrary include [28, 4]. Readers unconcerned with foundational 
issues, and whether fuzzy methods provide any representational power or other ben- 
efits above standard probability (including subjective probability) can consult [25], 
which is loaded with over 3000 references. , many connectives for fuzzy logic [2] 

Early reference on the use of potential functions for pattern classification is [1, 6]. 
This is closely allied with later work such as the RCE network described in [37, 36]. 

Fisher’s early work on linear discriminants [15], is well described in [30] and a 
number of standard textbooks [9, 13, 20, 30, 42]. 


Problems 


Q Section 4.3 
1. Show that Eqs. 19-22 are sufficient to assure convergence in Eqs. 17 & 18. 


2. Consider a normal p(x) ~ N(u,0?) and Parzen-window function y(x) ~ N(0, 1). 
Show that the Parzen-window estimate 


1 TtT— f; 
pole) = o (25%), 
n i= n 


has the following properties: 
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(a) Pn(x) ~ N(u, 0? + hi) 


(b) Var[pa()] = mrzle) 


© p(w) guta) = 3(4) [1 — (52) pto) 


for small hp. (Note: if hn = hi/./n, this shows that the error due to bias goes to zero 
as 1/n, whereas the standard deviation of the noise only goes to zero as Wn.) 

3. Let p(x) ~ U(0,a) be uniform from 0 to a, and let a Parzen window be defined 
as p(x) =e” for x > 0 and 0 for x < 0. 


(a) Show that the mean of such a Parzen-window estimate is given by 


0 x <0 
Palt) = 4 (1 — e2/hn) O<a<a 
+ (e2/hn - 1)e2/Pn a<r. 


(b) Plot p, (1) versus x for a = 1 and hn = 1,1/4, and 1/16. 


(c) How small does h,, have to be to have less than one percent bias over 99 percent 
of the range 0 < x < a? 


(d) Find Ahn for this condition if a = 1, and plot p, (1) in the range 0 < x < 0.05. 


4. Suppose in a c-category supervised learning environment we sample the full 
distribution p(x), and train a PNN classifier according to Algorithm ??. 


(a) Show that even if there are unequal category priors and hence unequal numbers 
of points in each category, the recognition method gives the right solution. 


(b) Suppose we have trained a PNN with the assumption of equal category priors, 
but later wish to use it for a problem having the cost matrix A;;, representing 
the cost of choosing category w; when in fact the pattern came from w;. How 
should we do this? 


(c) Suppose instead we know a cost matrix A;; before training. How shall we train 
a PNN for minimum risk? 
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5. Show that Eq. 31 converges in probability to p(x) given the conditions lim kp > 
n—>00 
oo and lim k,/n—> 0. 
Nn—>00 

6. Let D = [x1,..., Xn} be a set of n independent labelled samples and let D(x) = 
(x/,..., Xp) be the k nearest neighbors of x. Recall that the k-nearest-neighbor rule 
for classifying x is to give x the label most frequently represented in D;(x). Consider 
a two-category problem with P(w1) = P(w2) = 1/2. Assume further that the condi- 
tional densities p(x|w;) are uniform within unit hyperspheres a distance of ten units 
apart. 


(a) Show that if k is odd the average probability of error is given by 
(k—1)/2 


a) 


P,,(e) 
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(b) Show that for this case the single-nearest neighbor rule has a lower error rate 
than the k-nearest-neighbor error rate for k > 1. 


(c) If k is allowed to increases with n but is restricted by k < ayn, show that 
Pale) = 0 as n > 00. 
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7. Prove that the Voronoi cells induced by the single-nearest neighbor algorithm 
must always be convex. That is, for any two points x; and xg in a cell, all points on 
the line linking x; and xg must also lie in the cell. 

8. It is easy to see that the nearest-neighbor error rate P can equal the Bayes rate 
P* if P* = 0 (the best possibility) or if P* = (c— 1)/c (the worst possibility). One 
might ask whether or not there are problems for which P = P* when P* is between 
these extremes. 


(a) Show that the Bayes rate for the one-dimensional case where P(w;) = 1/c and 


C 
1 0<1< % 


P(alwi)=<¢ 1 ¡<x<i+1- % 
0 elsewhere 


is P* =r. 
(b) Show that for this case that the nearest-neighbor rate is P = P*. 


9. Consider the following set of two-dimensional vectors: 


Wy wa W3 
Tı T2 Tı T2 | Ti T2 
10 0 5 10| 2 8 
0 -10/0 5 | -5 2 
5 -2 5 5 10 -4 


(a) Plot the decision boundary resulting from the nearest-neighbor rule just for 
categorizing wı and wa. Find the sample means m; and ma and on the same 
figure sketch the decision boundary corresponding to classifying x by assigning 
it to the category of the nearest sample mean. 


(b) Repeat part (a) for categorizing only wı and wz. 


— 
o 
== 


Repeat part (a) for categorizing only wa and w3. 
(d) Repeat part (a) for a three-category classifier, classifying w1, w2 and w3. 


10. Prove that the computational complexity of the basic nearest-neighbor editing 
algorith (Algorithm ??) for n points in d dimension is O(d?nl4/?JInn). 

11. To understand the “curse of dimensionality” in greater depth, consider the 
effects of high dimensions on the simple nearest-neighbor algorithm. Suppose we 
need to estimate a density function f(x) in the unit hypercube in R based on n 
samples. If f(x) is complicated, we need dense samples to learn it well. 
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(a) Let nı denote the number of samples in a “dense” sample in R!. What is the 
sample size for the “same density” in R4? If nı = 100, what sample size is 
needed in a 20-dimensional space? 


(b) Show that the interpoint distances are all large and roughly equal in R%, and 
that neighborhoods that have even just a few points must have large radii. 


(c) Find la(p), the length of a hypercube edge in d dimensions that contains the 
fraction p of points (0 < p < 1). To better appreciate the implications of your 
result, calculate: l5 (0.01), 15(0.1), l20(0.01), and l2p(0.1). 


(d) Show that nearly all points are close to an edge of the full space (e.g., the unit 
hypercube in d dimensions). Do this by calculating the Lo, distance from one 
point to the closest other point. This shows that nearly all points are closer to 
an edge than to another training point. (Argue that Lo, is more favorable than 
Lə distance, even though it is easier to calculate here.) The result shows that 
most points are on or near the convex hull of training samples and that nearly 
every point is an “outlier” with respects to all the others. 


12. Show how the “curse of dimensionality” (Problem 11) can be “overcome” by 
choosing or assuming that your model is of a particular sort. Suppose that we are 
estimating a function of the form y = f(x) + N(0,07). 


n 
(a) Suppose the true function is linear, f(x) = > ajxj, and that the approximation 


j=1 
pe n 
is f(x) = )> Gjz;. Of course, the fit coefficients are: 
j=1 
2 
n d 
âj = arg min ) Yi — ) AjLiz| , 
aj 4 : 
{=l g=l 


for j = 1,...,d. Prove that E[f(x) — f(x)|? = do?/n, i.e., that it increases 

linearly with d, and not exponentially as the curse of dimensionality might 

otherwise suggest. 

(b) Generalize your result from part (a) to the case where a function is expressed 

n 

in a different basis set, i.e., f(x) = Y) a;B;(x) for some well-behaved basis set 
i=1 

B;(1), and hence that the result does not depend on the fact that we have used 

a linear basis. 


13. Consider classifiers based on samples from the distributions 


is 2x for0<x<1 
Bee = 0 otherwise, 
and 


Ges: = 2-22 for0<x<l 
Day = 0 otherwise. 


(a) What is the Bayes decision rule and the Bayes classification error? 
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(b) Suppose we randomly select a single point from w: and a single point from wa, 
and create a nearest-neighbor classifier. Suppose too we select a test point from 
one of the categories (w, for definiteness). Integrate to find the expected error 
rate P(e). 


(c) Repeat with two training samples from each category and a single test point in 
order to find Pa(e). 


(d) Generalize to find the arbitrary P,,(e). 


(e) Compare lim P,,(e) with the Bayes error. 


n—00 


14. Repeat Problem 13 but with 


3/2 for0<a< 2/3 
prole): = { 0 otherwise, 
and 
o 3/2 for1/3<xw<1 
pakel = { 0 otherwise. 


15. Expand in greater detail Algorithm 3 and add a conditional branch that will 
speed it. Assuming the data points come from c categories and there are, on average, 
k Voronoi neighbors of any point x, on average how much faster will your improved 
algorithm be? 

16. Consider the simple nearest-neighbor editing algorithm (Algorithm 3). 


(a) Show by counterexample that this algorithm does not yield the minimum set of 
points. (Hint: consider a problem where the points from each of two-categories 
are constrained to be on the intersections of a two-dimensional Cartesian grid.) 


(b) Create a sequential editing algorithm, in which each point is considered in turn, 
and retained or rejected before the next point is considered. Prove that your 
algorithm does or does not depend upon the sequence the points are considered. 


17. Consider classification problem where each of the c categories possesses the same 
distribution as well as prior P(w;) = 1/c. Prove that the upper bound in Eq. 53, i.e., 


P< P (2- E e, 


is achieved in this “zero-information” case. 
18. Derive Eq. 55. 
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19. Consider the Euclidean metric in d dimensions: 
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Suppose we rescale each axis by a fixed factor, i.e., let x, = 02, for real, non-zero 
constants az, k = 1,2,...,d. Prove that the resulting space is a metric space. Discuss 
the import of this fact for standard nearest-neighbor classification methods. 

20. Prove that the Minkowski metric indeed possesses the four properties required 
of all metrics. 

21. Consider a non-iterative method for finding the tangent distance between x’ and 
x, given the matrix T consisting of the r (column) tangent vectors TV; at x’. 


(a) As given in the text, take the gradient of the squared Euclidean distance in the 
a parameter space to find an equation that must be solved for the optimal a. 


(b) Solve your first derivative equation to find the optimizing a. 


(c) Compute the second derivative of D?(-,-) to prove that your solution must be 
a minimum squared distance, and not a maximum or inflection point. 


(d) If there are r tangent vectors (invariances) in a d-dimensional space, what is the 
computational complexity of your method? 


(e) In practice, the iterative method described in the text requires only a few 
(roughly 5) iterations for problems in handwritten OCR. Compare the com- 
plexities of your analytic solution to that of the iterative scheme. 


22. Consider a tangent-distance based classifier based on n prototypes, each rep- 
resenting a k x k pixel pattern of a handwritten character. Suppose there are r 
invariances we believe characterize the problem. What is the storage requirements 
(space complexity) of such a tangent-based classifier? 

23. The two-sided tangent distance allows both the stored prototype x’ and the test 
point x to be transformed. Thus if T is the matrix of the r tangent vectors for x’ and 
S likewise at x, the two-sided tangent distance is 


Datan(X’, x) = mind + Ta) — (x + Sb)|]. 


(a) Follow the logic in Problem 21 and calculate the gradient with respect to the a 
parameter vector and to the b parameter vector. 


(b) What are the two update rules for an iterative scheme analogous to Eq. 64? 


(c) Prove that there is a unique minium as a function of a and b. Describe this 
geometrically. 


(d) In an iterative scheme, we would alternatively take steps in the a parameter 
space, then the b parameter space. What is the computational complexity to 
this approach to the two-sided tangent distance classifier? 


(e) Why is the actual complexity for classification in a 2-sided tangent distance 
classifier even more sever than your result in (d) would suggest? 


24. Consider the two-sided tangent distance described in Problem 23. Suppose we 
restrict ourselves to n prototypes x’ in d dimensions, each with an associated matrix 
T of r tangent vectors, which we assume are linearly independent. Determine whether 
the two-sided tangent distance does or does not satisfy each of the requirements of a 
metric: non-negativity, reflexivity, symmetry and the triangle inequality. 
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25. Consider the computational complexity of nearest neighbor classifier for k x 
k pixel grayscale images of handwritten digits. Instead of using tangent distance, 
we will search for the parameters of full nonlinear transforms before computing a 
Euclidean distance. Suppose the number of operations needed to perform each of our 
r transformations (e.g., rotation, line thinning, shear, and so forth) is a;k?, where 
for the sake of simplicity we assume a; ~ 10. Suppose too that for the test of each 
prototype we must search though A ~ 5 such values, and judge it by the Euclidean 
distance. 


(a) Given a transformed image, how many operations are required to calculate the 
Euclidean distance to a stored prototype? 


(b) Find the number of operations required per search. 


(c) Suppose there are n prototypes. How many operations are required to find the 
nearest neighbor, given such transforms? 


(d) Assume for simplicity that no complexity reduction methods have been used 
(such as editing, partial distance, graph creation). If the number of prototypes 
is n = 10° points, and there are r = 6 transformations, and basic operations on 
our computer require 107? seconds, how long does it take to classify a single 
point? 


26. Explore the effect of r on the accuracy of nearest-neighbor search based on 
partial distance. Assume we have a large number n of points randomly placed in a 
d-dimensional hypercube. Suppose we have a test point x, also selected randomly 
in the hypercume, and find its nearest neighbor. By definition, if we use the full 
d-dimensional Euclidean distance, we are guaranteed to find its nearest neighbor. 
Suppose though we use the partial distance 


i=1 


(a) Plot the probability that a partial distance search finds the true closest neighbor 
of an arbitrary point x as a function of r for fixed n (1 < r < d) for d= 10. 


(b) Consider the effect of r on the accuracy of a nearest-neighbor classifier. Assume 
we have have n/2 prototypes from each two categories in a hypercube of length 1 
on a side. The density for each category is separable into the product of (linear) 
ramp functions, highest at one side, and zero at the other side of the range. 
Thus the density for category w1 is highest at (0,0, ...0)’ and zero at (1,1,..., 1), 
while the density for wa is highest at (1,1,..., 1)% and zero at (0,0,...0)'. State 
by inspection the Bayesian decision boundary. 


(c) Calculate the Bayes error rate. 


(d) Calculate the probability of correct classification of a point x, randomly selected 
from one of the category densities, as a function of r in a partial distance metric. 


(e) If n = 10, what must r be for the partial distance nearest neighbor classifier to 
be within 1% of the Bayes rate? 


4.11. PROBLEMS 59 


27. Consider the Tanimoto metric applied to sets having discrete elements. 


(a) Determine whether the four properties of a metric are obeyed by Dranimoto(-,) 
as given in Eq. 59. 


(b) Consider the following six words as mere sets of unordered letters: pattern, 
pat, pots, stop, taxonomy and elementary. Use the Tanimoto metric to rank 
order all ($) = 30 possible pairings of these sets. 


(c) Is the triangle inequality obeyed for these six patterns? 
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28. Suppose someone asks you whether a cup of water is hot or cold, and you respond 
that it is warm. Explain why this exchange in no way indicates that the membership 
of the cup in some “hot” class is a graded value less than 1.0. 

29. Consider the design a fuzzy classifier for three types of fish based on two features: 
length and lightness. The designer feels that there are five ranges of length: short, 
medium-short, medium, medium-large and large. Similarly, lightness falls into three 
ranges: dark, medium and light. The designer uses the traingle function 


a Uli e dl x < |p; — d;| 
T(z; i, Os = ôi ae h i 
(25 14,54) { 0 otherwise. 


for the intermediate values, and an open triangle function for the extremes, i.e., 


1 z£ > pi 
C(x, mi, ði) = 4 1-55 pi — Oi SS pi 
0 otherwise, 


and its symmetric version. 

Suppose we have for the length 6; = 5 and p 5, la 7, H3 9, u4 11 
and u5 = 13, and for lightness 6; = 30, 41 = 30, u2 = 50, and u3 = 70. Suppose 
the designer feels that wı = medium-light and long, w2 = dark and short and w3 = 
medium dark and long, where the conjunction rule “and” is defined in Eq. 65. 


(a) Write the algebraic form of the discriminant functions. 


(b) If every “category membership function” were rescaled by a constant, would 
classification change? 


(c) Classify the pattern x = 7.5, 60. 


(d) Suppose that instead we knew that pattern is wy. Would we have any principled 
way to know whether the error was due to the number of category membership 
functions? their functional form? the conjunction rule? 
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30. Suppose that through standard training of an RCE network (Algorithm 4), all 
the radii have been reduced to values less than Am. Prove that there is no subset of 
the training data that will yield the same category decision boundary. 
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31. Consider a window function y(x) ~ N(0,1) and a density estimate 


Pala =e ye (E) 


Approximate this estimate by factoring the window function and expanding the factor 
er—ti/hn in a Taylor series about the origin as follows: 


(a) Show that in terms of the normalized variable u = 2/h, the m-term approxi- 
mation is given by 


m-—1 
1 2 : 
ai —u* /2 X ) j 
nm\ tT) = ——e bju 
Pnm(Z) Th, a j 


where 


by = = 2 - ue 8/2, 


b) Suppose that the n samples happen to be extremely tightly clustered about 
y ug 
u = ug. Show that the two-term approximation peaks at the two points where 
u? + u/up —1=0. 


LS. 


(c) Show that one peak occurs approximately at u = uo, as desired, if uy < 1, but 
that it moves only to u = 1 for uy > 1. 


(d) Confirm your answer to part (c) by plotting p,2(u) versus u for uy = 0.01, 1, 
and 10. (Note: you may need to rescale the graphs vertically.) 
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32. Let p(x|w;) be arbitrary densities with means u; and covariance matrices X; 
— not necessarily normal — for i = 1,2. Let y = w*x be a projection, and let the 
induced one-dimensional densities p(y|w;) have means ju; and variances 0?. 


(a) Show that the criterion function 
(ua — 2)? 
J = n 
me oi +0% 


is maximized by 


= (Sy +Y>) (H; — py). 


4.11. PROBLEMS 61 


(b) If P(w;) is the prior probability for w;, show that 


= (m — Hy)? 
dw) = P(wi)o? + P(w2)o3 end 


[P(w)E1 + Plwz)E2) "(uy — 12). 


WwW 


(c) To which of these criterion functions is the J(w) of Eq. ?? more closely related? 
Explain. 


33. The expression 


il 
ng E 


yi€Y1 yjEV2 
clearly measures the total within-group scatter. 


(a) Show that this within-group scatter can be written as 


1 1 
Jı = (mı may? | si | a 
nı na 
(b) Show that the total scatter is 
1 ll 
Ja = — s? + — s2. 
Ny na 


c) If y = w*x, show that the w optimizing Jı subject to the constraint that Jo = 1 
y 8 


is given by 
1 1 =A 
w =(—-Si +82) (mı — ma), 
where 
1/2 
A= E mə) ( Si+ -82) (mı -= ma) , 
1 
m = — X, 

i xED 

and 


S; = 5 ni(m; — m)(m; — m). 


xED; 


34. If Sg and Sw are two real, symmetric, d-by-d matrices, it is well known that there 

exists a set of n eigenvalues Ay, ..., An satisfying |Sp — ASw| = 0, and a corresponding 
set of n eigenvectors e;,...,€e, satisfying Spe; = A;Swe;. Furthermore, if Sy is 
positive definite, the eigenvectors can always be normalized so that e/Sye; = 6;; and 
e'Spe; = di0;;. Let Sw = W'SwW and Sg = W*SgW, where W is a d-by-n 
matrix whose columns correspond to n distinct eigenvectors. 
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(a) Show that Sw is the n-by-n identity matrix I, and that Sy is a diagonal ma- 
trix whose elements are the corresponding eigenvalues. (This shows that the 
discriminant functions in multiple discriminant analysis are uncorrelated.) 


(b) What is the value of J = |Sz|/|Sw| ? 


(c) Let y = W*x be transformed by scaling the axes with a nonsingular n-by-n 
diagonal matrix D and by rotating this result with an orthogonal matrix Q 
where y’ = QDy. Show that J is invariant to this transformation. 


35. Consider two normal distributions with arbitrary but equal covariances. Prove 
that the Fisher linear discriminant, for suitable threshold, can be derived from the 
negative of the log-likelihood ratio. 

36. Consider the criterion function J(w) required for the Fisher linear discriminant. 


(a) Fill in the steps leading from Eqs. 77, 79 & 83 to Eq. 84. 


(b) Use matrix methods to show that the solution to Eq. 84 is indeed given by 
Eq. 85. 


(c) At the extreme of J(w), a small change in w must leave J(w) unchanged. 
Consider a small perturbation away from the optimal, w + Aw, and derive the 
solution condition of Eq. 85. 
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37. Consider multidiscriminant versions of Fisher’s method for the case of c Gaussian 
distributions in d dimensions, each having the same covariance Y (otherwise arbitrary) 
but different means. Solve for the optimal subspace in terms of © and the d mean 
vectors. 


Computer exercises 


Several exercises will make use of the following three-dimensional data sampled from 
three categories, denoted w;. 


Wy wa W3 
sample | 21 T2 £3 Tı £2 £3 Ly T2 23 
1 0.28 1.31 -6.2 0.011 1.03 -0.21 | 1.36 2.17 0.14 
2 0.07 0.58 -0.78 | 1.27 1.28 0.08 | 1.41 1.45 -0.38 
3 1.54 2.01 -1.63 | 0.13 3.12 0.16 | 1.22 0.99 0.69 
4 -0.44 1.18 -4.32 | -0.21 1.23 -0.11 | 2.46 2.19 1.31 
5 -0.81 0.21 5.73 | -2.18 1.39 -0.19 | 0.68 0.79 0.87 
6 152 3.16 2.77 | 0.34 1.96 -0.16 | 2.51 3.22 1.35 
7 220 2.42 -0.19 | -138 0.94 0.45 | 0.60 2.44 0.92 
8 0.91 1.94 6.21 |-0.12 0.82 0.17 | 064 0.13 0.97 
9 0.65 1.93 4.38 | -1.44 2.31 0.14 | 0.85 0.58 0.99 
10 -0.26 0.82 -0.96 | 0.26 1.94 0.08 | 0.66 0.51 0.88 
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1. Explore some of the properties of density estimation in the following way. 
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a) Write a program to generate points according to a uniform distribution in a unit 
g 8 g 
cube, —1/2 < x; < 1/2 for i = 1,2,3. Generate 104 such points. 


(b) Write a program to estimate the density at the origin based on your 104 points as 
a function of the size of a cubical window function of size h. Plot your estimate 
as a function of h, for 0< h< 1. 


(c) Evaluate the density at the origin using n of your points and the volume of a 
cube window which just encloses n points. Plot your estimate as a function of 
n = 1,..., 104. 


(d) Write a program to generate 104 points from a spherical Gaussian density (with 
Y = I) centered on the origin. Repeat (b) & (c) with your Gaussian data. 


(e) Discuss any qualitative differences between the functional dependencies of your 
estimation results for the uniform and Gaussian densities. 
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2. Consider Parzen-window estimates and classifiers for points in the table above. 
Let your window function be a spherical Gaussian, i.e., 


p((x— xi)/h) x Exp[—(x — xi)"(x — xi)/(2h7)]. 


(a) Write a program to classify an arbitrary test point x based on the Parzen window 
estimates. Train your classifier using the three-dimensional data from your three 
categories in the table above. Set h = 1 and classify the following three points: 
(0.50, 1.0, 0.0)’, (0.31, 1.51, —0.50) and (—0.3, 0.44, —0.1)*. 


(b) Repeat with h = 0.1. 


Q Section 4.4 


3. Consider k-nearest-neighbor density estimations in different numbers of dimen- 
sions 


(a) Write a program to find the k-nearest-neighbor density for n (unordered) points 
in one dimension. Use your program to plot such a density estimate for the x1 
values in category wg in the table above for k = 1,3 and 5. 


(b) Write a program to find the k-nearest-neighbor density estimate for n points 
in two dimensions. Use your program to plot such a density estimate for the 
11 — 1292 values in wa for k = 1,3 and 5. 


(c) Write a program to form a k-nearest-neighbor classifier for the three-dimensional 
data from the three categories in the table above. Use your program with k = 


1,3 and 5 to estimate the relative densities at the following points: (—0.41, 0.82, 0.88)*, 


(0.14, 0.72, 4.1)¢ and (—0.81, 0.61, —0.38)*. 
Q Section 4.5 


4. Write a program to create a Voronoi tesselation in two dimensions as follows. 


CHAPTER 4. NONPARAMETRIC TECHNIQUES 


First derive analytically the equation of a line separating two arbitrary points. 


Given the full data set D of prototypes and a particular point x € D, write a 
program to create a list of line segments comprising the Voronoi cell of x. 


Use your program to form the Voronoi tesselation of the x; — x2 features from 
the data of w; and wg in the table above. Plot your Voronoi diagram. 


Write a program to find the category decision boundary based on this full set 
D. 


Implement a version of the pruning method described in Algorithm 3. Prune 
your data set from (c) to form a condensed set. 


Apply your programs from (c) & (d) to form the Voronoi tesselation and bound- 
ary for your condensed data set. Compare the decision boundaries you found 
for the full and the condensed sets. 


5. Explore the tradeoff between computational complexity (as it relates to par- 
tial distance calculations) and search accuracy in nearest-neighbor classifiers in the 
following exercise. 


(a) 


Write a program to generate n prototypes from a uniform distributions in a 
6-dimensional hypercube centered on the origin. Use your program to generate 
10% points for category w1, 10% different points for category wa, and likewise for 
w3 and w4. Denote this full set D. 


Use your program to generate a test set D; of n = 100 points, also uniformly 
distributed in the 6-dimensional hypercube. 


Write a program to implement the nearest-neighbor neighbor algorithm. Use 
this program to label each of your points in D, by the category of its nearest 
neighbor in D. From now on we will assume that the labels you find are in fact 
the true ones, and thus the “test error” is zero. 


Write a program to perform nearest-neighbor classification using partial dis- 
tance, based on just the first r features of each vector. Suppose we define the 
search accuracy as the percentage of points in D; that are associated with their 
particular closest prototype in D. (Thus for r = 6, this accuracy is 100%, by 
construction.) For 1 < r < 6 in your partial distance classifier, estimate the 
search accuracy. Plot a curve of this search accuracy versus r. What value of r 
would give a 90% search accuracy? (Round r to the nearest integer.) 


Estimate the “wall clock time” — the overall time required by your computer 
to perform the search — as a function of r. If T is the time for a full search 
in six dimensions, what value of r requires roughly 7/2? What is the search 
accuracy in that case? 


Suppose instead we define search accuracy as the classification accuracy. Esti- 
mate this classification accuracy for a partial distance nearest-neighbor classifier 
using your points of D¿. Plot this accuracy for 1 < r < 6. Explain your result. 


Repeat (e) for this classification accuracy. If T is the time for full search in d 
dimensions, what value of r requires roughly T'/2? What is the classification 
search accuracy in this case? 
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6. Consider nearest-neighbor classifiers employing different values of k in the Ly, 
norm or Minkowski metric. 


(a) Write a program to implement a nearest-neighbor classifier for c categories, using 
the Minkowski metric or Lk norm, where k can be selected at classification time. 


(b) Use the three dimensional data in the table above to classify the following points 
using the Lẹ norm for k = 1,2,4 and oo: (2.21,1.9, 0.43)’, (—0.15, 1.17, 6.19)* 
and (0.01, 1.34, 2.60)”. 


7. Create a 10 x 10 pixel grayscale pattern x’ of a handwritten 4. 


(a) Plot the Euclidean distance between the 100-dimensional vectors corresponding 
to x’ and a horizontally shifted version of it as a function of the horizontal offset. 


(b) Shift x’ by two pixels to the right to form the tangent vector TV¡. Write a 
program to calculate the tangent distance for shifted patterns using your TV}. 
Plot the tangent distance as a function of the displacement of the test pattern. 
Compare your graphs and explain the implications. 


8. Repeat Computer exercise 7 but for a handwritten 7, and vertical translations. 
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9. Assume that size, color and shape are appropriate descriptions of fruit, and 
use fuzzy methods to classify fruit. In particular, assume all “category membership” 
functions are either triangular (with center yu and full half-width ô) or, at the extremes, 
are left- or right-open triangular functions. 

Suppose the size features (measured in cm) are: small (u = 2), medium (u = 
4), large (u = 6), and extra-large (u = 8). In all cases we assume the category 
membership tions have 6 = 3. Suppose shape is described by the excentricity, here 
the ratio of the major axis to minor axis lengths: thin (u = 2,0 = .6), oblong 
(u = 1.6,6 = .3), oval (u = 1.4,0 = .2) and spherical (u = 1.1,9 = .2). Suppose 
color here is represented by some measure of the mixture of red to yellow: yellow 
(u = .1,9 = .1), yellow-orange (u = 0.3,0 = 0.3), orange (u = 0.5,6 = 0.3), orange- 
red (u = 0.7,6 = 0.3) and red (u = 0.9,6 = 0.3). The fuzzy practitioner believes the 
following are good descriptions of some common fruit: 


e w = cherry = {small and spherical and red} 

e w = orange = {medium and spherical and orange} 

e w3 = banana = {large and thin and yellow} 

e w4 = peach = {medium and spherical and orange-red} 
e ws = plum = {medium and spherical and red} 

e we = lemon = {medium and oblong and yellow} 

e w7 = grapefruit = {medium and spherical and yellow} 


(a) Write a program to take any objective pattern and classify it. 
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(b) Classify each of these {size, shape, color}: {2.5,1.0,0.95}, {7.5,1.9,0.2} and 
{5.0, 0.5, 0.4}. 


(c) Suppose there is a cost associated with classification, as described by a cost 
matrix A;; — the cost of selecting w; given that the true category is wj. Suppose 
the cost matrix is 
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Reclassify the patterns in (b) for minimum cost. 
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10. Explore relaxation networks in the following way. 


(a) Write a program to implement an RCE classifier in three dimensions. Let the 
starting radius be Am = 0.5. Train your classifier with the data from the three 
categories in the table above. For this data, how many times was any sphere 
reduced in size? (If the same sphere is reduced two times, count that as twice.) 


(b) Use your classifier to classify the following: (0.53, —0.44, 1.1), (0.49, 0.44, 1.11) 
and (0.51, —0.21,2.15)*. If the classification of any point is ambiguous, state 
which are the candidate categories. 


Q Section 4.9 


11. Consider a classifier based on a Taylor series expansion of a Gaussian window 
function. Let k be the highest power of x; in a Taylor series expansion of each of the 
independent features of a two-dimensional Gaussian. Below, consider just the 7; — x2 
features of categories wa and wg in the table above. 


(a) For each value k = 2, 4, and 6, classify the following three points: (0.56, 2.3, 0.10)’, 
(0.60, 5.1, 0.86) and (—0.95, 1.3, 0.16)*. 


Q Section 4.10 
12. Consider the Fisher linear discriminant method. 


(a) Write a general program to calculate the optimal direction w for a Fisher linear 
discriminant based on three-dimensional data. 


(b) Find the optimal w for categories wa and ws in the table above. 


(c) Plot a line representing your optimal direction w and mark on it the positions 
of the projected points. 
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(d) In this subspace, fit each distribution with a (univariate) Gaussian, and find the 
resulting decision boundary. 


(e) What is the training error (the error on the training points themselves) in the 
optimal subspace you found in (b)? 


(£) For comparison, repeat (d) & (c) using instead the non-optimal direction w = 


(1.0, 2.0, -1.5)*. What is the training error in this non-optimal subspace? 


Q Section 4.11 


13. Consider the multicategory generalization of the Fisher linear discriminant, 
applied to the data in the table above. 


(a) Write a general program to calculate the optimal w for multiple discriminant. 
Use your program to find the optimal two-dimensional plane (described by nor- 
mal vector w) for the three-dimensional data in the table. 


(b) In the subspace, fit a circularly symmetric Gaussian to the data, and use a 
simple linear classifier in each to find the decision boundaries in the subspace. 


(c) What is the error on the training set? 


(d) Classify following points : (1.40, —0.36, —0.41)*, (0.62, 1.30, 1.11)* and (—0.11, 1.60, 1.51). 


— 
o 
Nw 


For comparison, repeat (b) & (c) for the non-optimal direction w = (—0.5, —0.5, 1.0)’. 
Explain the difference between your training errors in the two cases. 
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Chapter 5 


Linear Discriminant Functions 


5.1 Introduction 


n Chap. ?? we assumed that the forms for the underlying probability densities were 

known, and used the training samples to estimate the values of their parameters. 
In this chapter we shall instead assume we know the proper forms for the discriminant 
functions, and use the samples to estimate the values of parameters of the classifier. 
We shall examine various procedures for determining discriminant functions, some of 
which are statistical and some of which are not. None of them, however, requires 
knowledge of the forms of underlying probability distributions, and in this limited 
sense they can be said to be nonparametric. 

Throughout this chapter we shall be concerned with discriminant functions that 
are either linear in the components of x, or linear in some given set of functions 
of x. Linear discriminant functions have a variety of pleasant analytical properties. 
As we have seen in Chap. ??, they can be optimal if the underlying distributions 
are cooperative, such as Gaussians having equal covariance, as might be obtained 
through an intelligent choice of feature detectors. Even when they are not optimal, 
we might be willing to sacrifice some performance in order to gain the advantage of 
their simplicity. Linear discriminant functions are relatively easy to compute and in 
the absense of information suggesting otherwise, linear classifiers are an attractive 
candidates for initial, trial classifiers. They also illustrate a number of very important 
principles which will be used more fully in neural networks (Chap. ??). 

The problem of finding a linear discriminant function will be formulated as a prob- 
lem of minimizing a criterion function. The obvious criterion function for classification 
purposes is the sample risk, or training error — the average loss incurred in classifying 
the set of training samples. We must emphasize right away, however, that despite the 
attractiveness of this criterion, it is fraught with problems. While our goal will be to 
classify novel test patterns, a small training error does not guarantee a small test error 
— a fascinating and subtle problem that will command our attention in Chap. ??. 
As we shall see here, it is difficult to derive the minimum-risk linear discriminant 
anyway, and for that reason we investigate several related criterion functions that are 
analytically more tractable. 

Much of our attention will be devoted to studying the convergence properties 
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and computational complexities of various gradient descent procedures for minimizing 
criterion functions. The similarities between many of the procedures sometimes makes 
it difficult to keep the differences between them clear and for this reason we have 
included a summary of the principal results in Table 5.1 at the end of Sect. 5.10. 


5.2 Linear Discriminant Functions and Decision Sur- 
faces 


5.2.1 The Two-Category Case 


A discriminant function that is a linear combination of the components of x can be 
written as 


g(x) = w'x + wo, (1) 
where w is the weight vector and wo the bias or threshold weight. A two-category 
linear classifier implements the following decision rule: Decide w1 if g(x) > 0 and w2 
if g(x) < 0. Thus, x is assigned to wy if the inner product w*x exceeds the threshold 
—wy and wa otherwise. If g(x) = 0, x can ordinarily be assigned to either class, but 
in this chapter we shall leave the assignment undefined. Figure 5.1 shows a typical 
implementation, a clear example of the general structure of a pattern recognition 
system we saw in Chap. ??. 


g(x) 


Figure 5.1: A simple linear classifier having d input units, each corresponding to the 
values of the components of an input vector. Each input feature value x; is multiplied 
by its corresponding weight w;; the output unit sums all these products and emits a 
+1 if w*x + wo > 0 or a —1 otherwise. 


The equation g(x) = 0 defines the decision surface that separates points assigned 


to wı from points assigned to w2. When g(x) is linear, this decision surface is a 
hyperplane. If x; and x2 are both on the decision surface, then 


wx; + wo = w'x> + Wo 


or 


w' (x, — X2) =0, 
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and this shows that w is normal to any vector lying in the hyperplane. In general, 
the hyperplane H divides the feature space into two halfspaces, decision region Ry 
for wı and region Ra for wa. Since g(x) > 0 if x is in Ry, it follows that the normal 
vector w points into R 1. It is sometimes said that any x in Rı is on the positive side 
of H, and any x in Ra is on the negative side. 

The discriminant function g(x) gives an algebraic measure of the distance from x 
to the hyperplane. Perhaps the easiest way to see this is to express x as 


sd w 
xX=xX,+r— 
Po [wll 


where x, is the normal projection of x onto H, and r is the desired algebraic distance 
— positive if x is on the positive side and negative if x is on the negative side. Then, 
since g(x,) = 0, 


g(x) = wx + wo = r||wll, 


or 


g(x) 


= 
[w] 


In particular, the distance from the origin to H is given by wo/||w]|. If wo > 0 the 
origin is on the positive side of H, and if wọ < 0 it is on the negative side. If wọ = 0, 
then g(x) has the homogeneous form w*x, and the hyperplane passes through the 
origin. A geometric illustration of these algebraic results is given in Fig. 5.2. 


Ri 


Ra 


Figure 5.2: The linear decision boundary H, where g(x) = w*x + wọ = 0, separates 
the feature space into two half-spaces Rı (where g(x) > 0) and Rə (where g(x) < 0). 


To summarize, a linear discriminant function divides the feature space by a hy- 
perplane decision surface. The orientation of the surface is determined by the normal 
vector w, and the location of the surface is determined by the bias wọ. The discrim- 
inant function g(x) is proportional to the signed distance from x to the hyperplane, 
with g(x) > 0 when x is on the positive side, and g(x) < 0 when x is on the negative 
side. 
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5.2.2 The Multicategory Case 


There is more than one way to devise multicategory classifiers employing linear dis- 
criminant functions. For example, we might reduce the problem to c — 1 two-class 
problems, where the ith problem is solved by a linear discriminant function that 
separates points assigned to w; from those not assigned to w;. A more extravagant 
approach would be to use c(c— 1)/2 linear discriminants, one for every pair of classes. 
As illustrated in Fig. 5.3, both of these approaches can lead to regions in which the 
classification is undefined. We shall avoid this problem by adopting the approach 
taken in Chap. ??, defining c linear discriminant functions 


gi(x) = wx; + wio i= 1, 2...) €, (2) 


and assigning x to w; if gi(x) > g;(x) for all j 4 i; in case of ties, the classification 
is left undefined. The resulting classifier is called a linear machine. A linear machine 
divides the feature space into c decision regions, with g;(x) being the largest discrim- 
inant if x is in region R;. If R; and Rj are contiguous, the boundary between them 
is a portion of the hyperplane H;; defined by 


or 


(w; wy)'x | (wio wjo) = 0. 


It follows at once that w; — w; is normal to H;;, and the signed distance from x 
to Hj; is given by (gi — g;)/||wi — w;||. Thus, with the linear machine it is not the 
weight vectors themselves but their differences that are important. While there are 
c(c — 1)/2 pairs of regions, they need not all be contiguous, and the total number of 
hyperplane segments appearing in the decision surfaces is often fewer than c(c— 1)/2, 
as shown in Fig. 5.4. 

It is easy to show that the decision regions for a linear machine are convex and this 
restriction surely limits the flexibility and accuracy of the classifier (Problems 1 & 2). 
In particular, for good performance every decision region should be singly connected, 
and this tends to make the linear machine most suitable for problems for which the 
conditional densities p(x|w;) are unimodal. 


5.3 Generalized Linear Discriminant Functions 


The linear discriminant function g(x) can be written as 


d 
g(x) = wo + a Witi, (3) 


where the coefficients w; are the components of the weight vector w. By adding 
additional terms involving the products of pairs of components of x, we obtain the 
quadratic discriminant function 


d d d 
g(x) = wo +) wizi + XO 252). (4) 
i=1 


i=1 j=1 
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01 


not œ; | 3 


Figure 5.3: Linear decision boundaries for a four-class problem. The top figure shows 
w;/not wi dichotomies while the bottom figure shows w;/w,; dichotomies. The pink 
regions have ambiguous category assigments. 


Since 2,2; = ©;x;, we can assume that wi; = wji with no loss in generality. Thus, the 
quadratic discriminant function has an additional d(d+1)/2 coefficients at its disposal 
with which to produce more complicated separating surfaces. The separating surface 
defined by g(x) = 0 is a second-degree or hyperquadric surface. The linear terms 
in g(x) can be eliminated by translating the axes. We can define W = [w;;), a 
symmetric, nonsingular matrix and then the basic character of the separating surface 
can be described in terms of the scaled matrix W = W/(w'W-!w — 4wo). If W 
is a positive multiple of the identity matrix, the separating surface is a hypersphere. 
If W is positive definite, the separating surfaces is a hyperellipsoid. If some of the 
eigenvalues of W are positive and others are negative, the surface is one of the variety 
of types of hyperhyperboloids (Problem 11). As we observed in Chap. ??, these are 
the kinds of separating surfaces that arise in the general multivariate Gaussian case. 

By continuing to add terms such as w,j;~.%;0j0, we can obtain the class of polyno- 
mial discriminant functions. These can be thought of as truncated series expansions 
of some arbitrary g(x), and this in turn suggest the generalized linear discriminant 
function 


d 
a(x) = Yao 65) 
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Figure 5.4: Decision boundaries produced by a linear machine for a three-class prob- 
lem and a five-class problem. 


or 


g(x) = aly, (6) 


where a is now a d-dimensional weight vector, and where the d functions y;(x) — some- 
times called y functions — can be arbitrary functions of x. Such functions might be 
computed by a feature detecting subsystem. By selecting these functions judiciously 
and letting d be sufficiently large, one can approximate any desired discriminant func- 
tion by such an expansion. The resulting discriminant function is not linear in x, but 
it is linear in y. The d functions yi(x) merely map points in d-dimenional x-space 
to points in d-dimensional y-space. The homogeneous discriminant aty separates 
points in this transformed space by a hyperplane passing through the origin. Thus, 
the mapping from x to y reduces the problem to one of finding a homogeneous linear 
discriminant function. 

Some of the advantages and disadvantages of this approach can be clarified by 
considering a simple example. Let the quadratic discriminant function be 


g(x) = a, + azz + aga?, (7) 


so that the three-dimensional vector y is given by 


1 


oe) le (8) 


The mapping from z to y is illustrated in Fig. 5.5. The data remain inherently one- 
dimensional, since varying x causes y to trace out a curve in three dimensions. Thus, 
one thing to notice immediately is that if x is governed by a probability law p(x), the 
induced density p(y) will be degenerate, being zero everywhere except on the curve, 
where it is infinite. This is a common problem whenever d > d, and the mapping 
takes points from a lower-dimensional space to a higher-dimensional space. 

The plane H defined by aty = 0 divides the y-space into two decision regions Ry 
and Ro. Figure ?? shows the separating plane corresponding to a = (—1,1,2)', the 
decision regions Ra and Ro, and their corresponding decision regions Rı and Rə in 
the original z-space. The quadratic discriminant function g(x) = —1 + x + 22? is 
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Figure 5.5: The mapping y = (1, x,1?)? takes a line and transforms it to a parabola 
in three dimensions. A plane splits the resulting y space into regions corresponding 
to two categories, and this in turn gives a non-simply connected decision region in the 
one-dimensional x space. 


positive if x < —1 or if x > 0.5, and thus Ry is multiply connected. Thus although 
the decision regions in y-space are convex, this is by no means the case in x-space. 
More generally speaking, even with relatively simple functions y;(x), decision surfaces 
induced in an x-space can be fairly complex (Fig. 5.6). 

Unfortunately, the curse of dimensionality often makes it hard to capitalize on 
this flexibility in practice. A complete quadratic discriminant function involves d= 
(d + 1)(d + 2)/2 terms. If d is modestly large, say d = 50, this requires the com- 
putation of a great many terms; inclusion of cubic and higher orders leads to O(d3) 
terms. Furthermore, the d components of the weight vector a must be determined 
from training samples. If we think of das specifying the number of degrees of freedom 
for the discriminant function, it is natural to require that the number of samples be 
not less than the number of degrees of freedom (cf., Chap. ??). Clearly, a general 
series expansion of g(x) can easily lead to completely unrealistic requirements for 
computation and data. We shall see in Sect. ?? that this drawback can be accom- 
modated by imposing a constraint of large margins, or bands between the training 
patterns, however. In this case, we are not technically speaking fitting all the free 
parameters; instead, we are relying on the assumption that the mapping to a high- 
dimensional space does not impose any spurious structure or relationships among the 
training points. Alternatively, multilayer neural networks approach this problem by 
employing multiple copies of a single nonlinear function of the input features, as we 
shall see in Chap. ??. 

While it may be hard to realize the potential benefits of a generalized linear dis- 
criminant function, we can at least exploit the convenience of being able to write 
g(x) in the homogeneous form a'y. In the particular case of the linear discriminant 
function 


AUGMENTED 
VECTOR 


10 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 


Figure 5.6: The two-dimensional input space x is mapped through a polynomial 
function f to y. Here the mapping is yy = 21, yo = T2 and yz X 11%2. A linear 
discriminant in this transformed space is a hyperplane, which cuts the surface. Points 
to the positive side of the hyperplane H correspond to category w1, and those beneath 
it w2. Here, in terms of the x space, Ry is a not simply connected. 


d d 
g(x) = wo +) wizi = Y wizi (9) 
i=l i=0 
where we set zo = 1. Thus we can write 
1 1 
Tı 
Ta 


and y is sometimes called an augmented feature vector. Likewise, an augmented weight 
vector can be written as: 


wo 


a= = ; (11) 


This mapping from d-dimensional x-space to (d+1)-dimensional y-space is mathe- 
matically trivial but nonetheless quite convenient. The addition of a constant compo- 
nent to x preserves all distance relationships among samples. The resulting y vectors 
all lie in a d-dimensional subspace, which is the x-space itself. The hyperplane deci- 
sion surface H defined by aty = 0 passes through the origin in y-space, even though 
the corresponding hyperplane H can be in any position in x-space. The distance from 
y to Á is given by |aty|/|ļall, or |g(x)|/|lal|. Since llal| > ||w||, this distance is less 
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than, or at most equal to the distance from x to H. By using this mapping we reduce 
the problem of finding a weight vector w and a threshold weight wo to the problem 
of finding a single weight vector a (Fig. 5.7). 


Figure 5.7: A three-dimensional augmented feature space y and augmented weight 
vector a (at the origin). The set of points for which aty = 0 is a plane (or more 
generally, a hyperplane) perpendicular to a and passing through the origin of y- 
space, as indicated by the red disk. Such a plane need not pass through the origin of 
the two-dimensional x-space at the top, of course, as shown by the dashed line. Thus 
there exists an augmented weight vector a that will lead to any straight decision line 
in x-space. 


5.4 The Two-Category Linearly-Separable Case 


5.4.1 Geometry and Terminology 


Suppose now that we have a set of n samples yj,..., yn, some labelled wı and some 
labelled wa. We want to use these samples to determine the weights a in a linear 
discriminant function g(x) = aty. Suppose we have reason to believe that there 
exists a solution for which the probability of error is very low. Then a reasonable 
approach is to look for a weight vector that classifies all of the samples correctly. If 
such a weight vector exists, the samples are said to be linearly separable. 

A sample y; is classified correctly if aty; > 0 and y; is labelled w or if aty; < 0 
and y; is labelled wə. This suggests a “normalization” that simplifies the treatment 
of the two-category case, viz., the replacement of all samples labelled w2 by their 
negatives. With this “normalization” we can forget the labels and look for a weight 
vector a such that a’y; > 0 for all of the samples. Such a weight vector is called a 
separating vector or more generally a solution vector. 

The weight vector a can be thought of as specifying a point in weight space. Each 
sample y; places a constraint on the possible location of a solution vector. The 
equation aty; = 0 defines a hyperplane through the origin of weight space having y; 
as a normal vector. The solution vector — if it exists — must be on the positive side 
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of every hyperplane. Thus, a solution vector must lie in the intersection of n half- 
spaces; indeed any vector in this region is a solution vector. The corresponding region 
is called the solution region, and should not be confused with the decision region in 
feature space corresponding to any particular category. A two-dimensional example 
illustrating the solution region for both the normalized and the unnormalized case is 
shown in Fig. 5.8. 


solution solution 
region y2 region y? 


Figure 5.8: Four training samples (black for w,, red for wa) and the solution region 
in feature space. The figure on the left shows the raw data; the solution vectors leads 
to a plane that separates the patterns from the two categories. In the figure on the 
right, the red points have been “normalized” — i.e., changed in sign. Now the solution 
vector leads to a plane that places all “normalized” points on the same side. 


From this discussion, it should be clear that the solution vector — again, if it 
exists — is not unique. There are several ways to impose additional requirements to 
constrain the solution vector. One possibility is to seek a unit-length weight vector 
that maximizes the minimum distance from the samples to the separating plane. 
Another possibility is to seek the minimum-length weight vector satisfying aty; > b 
for all i, where b is a positive constant called the margin. As shown in Fig. 5.9, the 
solution region resulting form the intersections of the halfspaces for which aly; >b>0 
lies within the previous solution region, being insultated from the old boundaries by 
the distance b/|ly;||. 

The motivation behind these attempts to find a solution vector closer to the “mid- 
dle” of the solution region is the natural belief that the resulting solution is more likely 
to classify new test samples correctly. In most of the cases we shall treat, however, 
we shall be satisfied with any solution strictly within the solution region. Our chief 
concern will be to see that any iterative procedure used does not converge to a limit 
point on the boundary. This problem can always be avoided by the introduction of a 
margin, i.e., by requiring that aty; > b > 0 for all i. 


5.4.2 Gradient Descent Procedures 


The approach we shall take to finding a solution to the set of linear inequalities 
aty; > 0 will be to define a criterion function J(a) that is minimized if a is a solution 
vector. This reduces our problem to one of minimizing a scalar function — a problem 
that can often be solved by a gradient descent procedure. Basic gradient descent is 
very simple. We start with some arbitrarily chosen weight vector a(1) and compute 
the gradient vector VJ(a(1)). The next value a(2) is obtained by moving some 
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Figure 5.9: The effect of the margin on the solution region. At the left, the case of 
no margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the 
right is the case b > 0, shrinking the solution region by margins b/|lyi;||. 


distance from a(1) in the direction of steepest descent, i.e., along the negative of the 
gradient. In general, a(k + 1) is obtained from a(k) by the equation 


a(k +1) =a(k) —- n(k)V J(a(k)), (12) 


where 7 is a positive scale factor or learning rate that sets the step size. We hope 
that such a sequence of weight vectors will converge to a solution minimizing J(a). 
In algorithmic form we have: 


Algorithm 1 (Basic gradient descent) 


1 begin initialize a, criterion 0,n/(-),k = 0 
2 dok=k+1 

3 a«a-—7(k)VJ(a) 

4 until n(k)VJ(a) <0 

5 returna 

6 end 


The many problems associated with gradient descent procedures are well known. 
Fortunately, we shall be constructing the functions we want to minimize, and shall be 
able to avoid the most serious of these problems. One that will confront us repeatedly, 
however, is the choice of the learning rate n(k). If n(k) is too small, convergence is 
needlessly slow, whereas if 7(k) is too large, the correction process will overshoot and 
can even diverge (Sect. 5.6.1). 

We now consider a principled method for setting the learning rate. Suppose that 
the criterion function can be well approximated by the second-order expansion around 
a value a(k) as 

J(a) ~ J(a(k)) + VJ*(a— a(k)) + >G — a(k))'H (a — a(k)), (13) 
where H is the Hessian matrix of second partial derivatives 6? J /0a;0a; evaluated at 
a(k). Then, substituting a(k + 1) from Eq. 12 into Eq. 13 we find: 


LEARNING 
RATE 


HESSIAN 
MATRIX 


NEWTON’S 
ALGORITHM 


14 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 


1 
J(a(k +1)) = J(a(k)) — n(k)|| VII? + at (kh) V ITH. 
From this it follows (Problem 12) that J(a(k + 1)) can be minimized by the choice 


nto = Le 
VIHVJ” 
where H depends on a, and thus indirectly on k. This then is the optimal choice 
of n(k) given the assumptions mentioned. Note that if the criterion function J(a) is 
quadratic throughout the region of interest, then H is constant and 7 is a constant 
independent of k. 
An alternative approach, obtained by ignoring Eq. 12 and by choosing a(k + 
1) to minimize the second-order expansion, is Newton’s algorithm where line 3 in 
Algorithm 1 is replaced by 


(14) 


a(k +1) = a(k) - H VJ, (15) 


leading to the following algorithm: 


Algorithm 2 (Newton descent) 


1 begin initialize a, criterion 0 
do 
aca—H!VJ(a) 
until H-V J(a) < 0 
5 return a 
6 end 


~ ODD 


Simple gradient descent and Newton’s algorithm are compared in Fig. 5.10. 

Generally speaking, Newton’s algorithm will usually give a greater improvement 
per step than the simple gradient descent algorithm, even with the optimal value 
of n(k). However, Newton’s algorithm is not applicable if the Hessian matrix H is 
singular. Furthermore, even when H is nonsingular, the O(d*) time required for 
matrix inversion on each iteration can easily offset the descent advantage. In fact, 
it often takes less time to set n(k) to a constant 7 that is smaller than necessary 
and make a few more corrections than it is to compute the optimal 7(k) at each step 
(Computer exercise 1). 


5.5 Minimizing the Perceptron Criterion Function 


5.5.1 The Perceptron Criterion Function 


Consider now the problem of constructing a criterion function for solving the linear 
inequalities aty; > 0. The most obvious choice is to let J(a; y1,.-.,yn) be the number 
of samples misclassified by a. However, because this function is piecewise constant, it 
is obviously a poor candidate for a gradient search. A better choice is the Perceptron 
criterion function 


Jp(a) = Y (—ay), (16) 
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Figure 5.10: The sequence of weight vectors given by a simple gradient descent method 
(red) and by Newton's (second order) algorithm (black). Newton's method typically 
leads to greater improvement per step, even when using optimal learning rates for both 
methods. However the added computational burden of inverting the Hessian matrix 
used in Newton's method is not always justified, and simple descent may suffice. 


where Y(a) is the set of samples misclassified by a. (If no samples are misclassified, 
Y is empty and we define J, to be zero.) Since aty < 0 if y is misclassified, J,(a) 
is never negative, being zero only if a is a solution vector, or if a is on the decision 
boundary. Geometrically, J,(a) is proportional to the sum of the distances from the 
misclassified samples to the decision boundary. Figure 5.11 illustrates J, for a simple 
two-dimensional example. 

Since the jth component of the gradient of J, is 0J,/0a;, we see from Eq. 16 that 


Vip = y), (17) 
yey 

and hence the update rule becomes 

a(k +1) =a(k)+n(k)S_ y, (18) 

y EY 

where Vp is the set of samples misclassified by a(k). Thus the Perceptron algorithm 
is: 
Algorithm 3 (Batch Perceptron) 


1 begin initialize a,n(-), criterion 6,k = 0 


2 do k=k+1 
3 a—a+n(k))» y 
y EY 
4 until n(k) Y) y<0 
yEYr 


5 return a 
6 end 
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Figure 5.11: Four learning criteria as a function of weights in a linear classifier. At the 
upper left is the total number of patterns misclassified, which is piecewise constant 
and hence unacceptable for gradient descent procedures. At the upper right is the 
Perceptron criterion (Eq. 16), which is piecewise linear and acceptable for gradient 
descent. The lower left is squared error (Eq. 32), which has nice analytic properties 
and is useful even when the patterns are not linearly separable. The lower right is 
the square error with margin (Eq. 33). A designer may adjust the margin b in order 
to force the solution vector to lie toward the middle of the b = 0 solution region in 
hopes of improving generalization of the resulting classifier. 


Thus, the batch Perceptron algorithm for finding a solution vector can be stated 
very simply: the next weight vector is obtained by adding some multiple of the sum 
of the misclassified samples to the present weight vector. We use the term “batch” 
to refer to the fact that (in general) a large group of samples is used when com- 
puting each weight update. (We shall soon see alternate methods based on single 
samples.) Figure 5.12 shows how this algorithm yields a solution vector for a simple 
two-dimensional example with a(1) = 0, and 7(k) = 1. We shall now show that it 
will yield a solution for any linearly separable problem. 
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Figure 5.12: The Perceptron criterion, Jp is plotted as a function of the weights a; 
and ag for a three-pattern problem. The weight vector begins at 0, and the algorithm 
sequentially adds to it vectors equal to the “normalized” misclassified patterns them- 
selves. In the example shown, this sequence is y2, Y3, Y1, y3, at which time the vector 
lies in the solution region and iteration terminates. Note that the second update (by 
y3) takes the candidate vector farther from the solution region than after the first 
update (cf. Theorem 5.1. (In an alternate, batch method, all the misclassified points 
are added at each iteration step leading to a smoother trajectory in weight space.) 


5.5.2 Convergence Proof for Single-Sample Correction 


We shall begin our examination of convergence properties of the Perceptron algo- 
rithm with a variant that is easier to analyze. Rather than testing a(k) on all of the 
samples and basing our correction of the set Y, of misclassified training samples, we 
shall consider the samples in a sequence and shall modify the weight vector when- 
ever it misclassifies a single sample. For the purposes of the convergence proof, the 
detailed nature of the sequence is unimportant as long as every sample appears in 
the sequence infinitely often. The simplest way to assure this is to repeat the sam- 
ples cyclically, though from a practical point of view random selection is often to be 
preferred (Sec. 5.8.5). Clearly neither the batch nor this single-sample version of the 
Perceptron algorithm are on-line since we must store and potentially revisit all of the 
training patterns. 

Two further simplifications help to clarify the exposition. First, we shall tem- 
porarily restrict our attention to the case in which n(k) is constant — the so-called 
fixed-increment case. It is clear from Eq. 18 that if y(t) is constant it merely serves to 
scale the samples; thus, in the fixed-increment case we can take 7(t) = 1 with no loss 
in generality. The second simplification merely involves notation. When the samples 
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are considered sequentially, some will be misclassified. Since we shall only change the 
weight vector when there is an error, we really need only pay attention to the mis- 
classified samples. Thus we shall denote the sequence of samples using superscripts, 
i.e., by yt, y?,..., y”,..., where each y* is one of the n samples yj, ..., yn, and where 
each y* is misclassified. For example, if the samples y1, y2, and y3 are considered 
cyclically, and if the marked samples 


l l l l y 
Y1, Y2, Y3, Y1, Y2, Y3, Y1, Y2, -- (19) 


are misclassified, then the sequence yt, y?, y, yf, y?,... denotes the sequence 
Y1; Y3, Y1, Y2, y2,-.- With this understanding, the fized-increment rule for generating 
a sequence of weight vectors can be written as 


a(1) arbitrar 
abeioaiaey ol = (20) 


where a‘(k)y* < 0 for all k. If we let n denote the total number of patterns, the 
algorithm is: 


Algorithm 4 (Fixed-increment single-sample Perceptron) 


1 begin initialize a,k =0 


2 do k< (k+1)modn 

3 if yy is misclassified by a then a <+ a-— yk 
4 until all patterns properly classified 

5 return a 

6 end 


The fixed-increment Perceptron rule is the simplest of many algorithms that have 
been proposed for solving systems of linear inequalities. Geometrically, its interpre- 
tation in weight space is particularly clear. Since a(k) misclassifies y*, a(k) is not on 
the positive side of the y* hyperplane aty = 0. The addition of y” to a(k) moves 
the weight vector directly toward and perhaps across this hyperplane. Whether the 
hyperplane is crossed or not, the new inner product a*(k + 1)y* is larger than the old 
inner product a'(k)y* by the amount ||y"||?, and the correction is clearly moving the 
weight vector in a good direction (Fig. 5.13). 
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Figure 5.13: Samples from two categories, wı (black) and wə (red) are shown in 
augmented feature space, along with an augmented weight vector a. At each step 
in a fixed-increment rule, one of the misclassified patterns, y", is shown by the large 
dot. A correction Aa (proportional to the pattern vector y*) is added to the weight 
vector — towards an w, point or away from an wa point. This changes the decision 
boundary from the dashed position (from the previous update) to the solid position. 
The sequence of resulting a vectors is shown, where later values are shown darker. In 
this example, by step 9 a solution vector has been found and the categories successfully 
separated by the decision boundary shown. 


Clearly this algorithm can only terminate if the samples are linearly separable; we 
now prove that indeed it terminates so long as the samples are linearly separable. 


Theorem 5.1 (Perceptron Convergence) If training samples are linearly sepa- 
rable then the sequence of weight vectors given by Algorithm 4 will terminate at a 
solution vector. 


Proof: 


In seeking a proof, it is natural to try to show that each correction brings the weight 
vector closer to the solution region. That is, one might try to show that if a is any 
solution vector, then ||a(k + 1) — al] is smaller than lla(k) — al]. While this turns out 


20 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 


not to be true in general (cf. steps 6 & 7 in Fig. 5.13), we shall see that it is true for 
solution vectors that are sufficiently long. 

Let â be any solution vector, so that âty; is strictly positive for all 7, and let a be 
a positive scale factor. From Eq. 20, 


a(k + 1) — aâ = (a(k) — aâ) + ye 


and hence 


lla(k +1) — aa]? = ||a(k) — al? + 2(a(k) — aá)'y* + lly" l’. 


Since y* was misclassified, a*(k)y* < 0, and thus 


lla(k +1) — aal|? < |la(k) — aá?— 2aa’y* + |y" l’. 


Because aty* is strictly positive, the second term will dominate the third if a is 
sufficiently large. In particular, if we let 8 be the maximum length of a pattern 
vector, 


6? = max |lyi?, (21) 


and y be the smallest inner product of the solution vector with any pattern vector, 
Le., 


y = min [a’yi] > 0, (22) 


then we have the inequality 


lla(k + 1) — al? < la(k) — aáll? — 207 + 8’. 


If we choose 


we obtain 


la(k + 1) — aâll? < la(k) — aál]? — 6”. 


Thus, the squared distance from a(k) to aa is reduced by at least 6? at each correction, 
and after k corrections 


lla(k +1) — aal|? < la(k) — aáll? — kp’. (24) 


Since the squared distance cannot become negative, it follows that the sequence of 
corrections must terminate after no more than kp corrections, where 


la(1) = aa]? 


ko = E 


(25) 
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Since a correction occurs whenever a sample is misclassified, and since each sample 
appears infinitely often in the sequence, it follows that when corrections cease the 
resulting weight vector must classify all of the samples correctly. M 


The number ko gives us a bound on the number of corrections. If a(1) = 0, we 
get the following particularly simple expression for ko: 


„ Alar eaa _ m lyla? , 
0 p? NN y? E min[y*al? A ( 6) 


The denominator in Eq. 26 shows that the difficulty of the problem is essentially 
determined by the samples most nearly orthogonal to the solution vector. Unfortu- 
nately, it provides no help when we face an unsolved problem, since the bound is 
expressed in terms of a solution vector which is unknown. In general, it is clear that 
linearly-separable problems can be made arbitrarily difficult to solve by making the 
samples almost coplanar (Computer exercise 2). Nevertheless, if the training sam- 
ples are linearly separable, the fixed-increment rule will yield a solution after a finite 
number of corrections. 


5.5.3 Some Direct Generalizations 


The fixed increment rule can be generalized to provide a variety of related algorithms. 
We shall briefly consider two variants of particular interest. The first variant intro- 
duces a variable increment n(k) and a margin b, and calls for a correction whenever 
a'(k)y* fails to excede the margin. The update is given by 


a(1) arbitrar 
ae a ee eee ee \ (27) 


where now a'(k)y* < b for all k. Thus for n patterns, our algorithm is: 
Algorithm 5 (Variable increment Perceptron with margin) 


1 begin initialize a,criterion 0, margin b,7(-),k = 0 


2 do k~k+1 

3 if a'y +b<0 then a+ a- n(k)yk 
4 until ay, +b < 0 for all k 

5 return a 

6 end 


It can be shown that if the samples are linearly separable and if 


n(k) > 0, (28) 
¿lim Y 7 n(k) = 00 (29) 
k=1 


and 


VARIABLE 
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CTN ” 


then a(k) converges to a solution vector a satisfying aty; > b for all i (Problem 18). 
In particular, these conditions on n(k) are satisfied if ņn(k) is a positive constant, or if 
it decreases like 1/k. 

Another variant of interest is our original gradient descent algorithm for Jp, 


a(1) arbitrary 


a(k+1)=a(k)+n(k) » y, (31) 


y EYk 


where YV; is the set of training samples misclassified by a(k). It is easy to see that this 
algorithm will also yield a solution once one recognizes that if á is a solution vector 
for y1,..., yn, then it correctly classifies the correction vector 


y= Y, y 


y EYk 
In greater detail, then, the algorithm is 
Algorithm 6 (Batch variable increment Perceptron) 


1 begin initialize a,7(-),k = 0 


2 dok«—k+1 

3 V; = {} 

4 j=0 

5 doj=3+1 

6 if y; is misclassified then Append y; to Vk 
7 until j =n 

8 acatn(k) Y y 

yEYk 

9 until Y, = {} 
10 return a 
11 end 


The benefit of batch gradient descent is that the trajectory of the weight vector is 
smoothed, compared to that in corresponding single-sample algorithms (e.g., Algo- 
rithm 5), since at each update the full set of misclassified patterns is used — the 
local statistical variations in the misclassified patterns tend to cancel while the large- 
scale trend does not. Thus, if the samples are linearly separable, all of the possible 
correction vectors form a linearly separable set, and if n(k) satisfies Eqs. 28-30, the 
sequence of weight vectors produced by the gradient descent algorithm for J,(-) will 
always converge to a solution vector. 

It is interesting to note that the conditions on 7(k) are satisfied if n(k) is a positive 
constant, if it decreases as 1/k, or even if it increases as k. Generally speaking, one 
would prefer to have n(k) become smaller as time goes on. This is particularly true 
if there is reason to believe that the set of samples is not linearly separable, since it 
reduces the disruptive effects of a few “bad” samples. However, in the separable case 
it is a curious fact that one can allow n(k) to become larger and still obtain a solution. 
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This observation brings out one of the differences between theoretical and practical 
attitudes. From a theoretical viewpoint, it is interesting that we can obtain a solution 
in a finite number of steps for any finite set of separable samples, for any initial weight 
vector a(1), for any nonnegative margin b, and for any scale factor n(k) satisfying 
Eqs. 28-30. From a practical viewpoint, we want to make wise choices for these 
quantities. Consider the margin b, for example. If b is much smaller than (k)||y*||?, 
the amount by which a correction increases a*(k)y*, it is clear that it will have little 
effect at all. If it is much larger than 7(k)||y*||?, many corrections will be needed 
to satisfy the conditions a'(k)y* > b. A value close to n(k)|ly*||? is often a useful 
compromise. In addition to these choices for 7(k) and b, the scaling of the components 
of y* can also have a great effect on the results. The possession of a convergence 
theorem does not remove the need for thought in applying these techniques. 

A close descendant of the Perceptron algorithm is the Winnow algorithm, which 
has applicability to separable training data. The key difference is that while the 
weight vector returned by the Perceptron algorithm has components a; (i = 0,...d), 
in Winnow they are scaled according to 2sinh[a;]. In one version, the balanced Win- 
now algorithm, there are separate “positive” and “negative” weight vectors, at and 
a”, each associated with one of the two categories to be learned. Corrections on the 
positive weight are made if and only if a training pattern in w, is misclassified; con- 
versely, corrections on the negative weight are made if and only if a training pattern 
in wa is misclassified. 


Algorithm 7 (Balanced Winnow) 


1 begin initialize at,a~,n(-),k —0,a>1 


2 if signfatty, — ayy] Æ zp (pattern misclassified) 

3 then if z,=+1 then af — atY%a?; a7 — Ya; for alli 
4 if z =—1 then af — a™”ia]; a, — ata; for alli 
5 return at,a™ 

6 end 


There are two main benefits of such a version of the Winnow algorithm. The 
first is that during training each of the two consituent weight vectors moves in a uni- 
form direction and this means the “gap,” determined by these two vectors, can never 
increase in size for separable data. This leads to a convergence proof that, while some- 
what more complicated, is nevertheless more general than the Perceptron convergence 
theorem (cf. Bibliography). The second benefit is that convergence is generally faster 
than in a Perceptron, since for proper setting of learning rate, each constituent weight 
does not overshoot its final value. This benefit is especially pronounced whenever a 
large number of irrelevant or redundant features are present (Computer exercise 6). 


5.6 Relaxation Procedures 


5.6.1 The Descent Algorithm 


The criterion function Jp is by no means the only function we can construct that is 
minimized when a is a solution vector. A close but distinct relative is 


Jaa) = > (yr, (32) 


WINNOW 
ALGORITHM 
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where Y(a) again denotes the set of training samples misclassified by a. Like Jp, Jy 
focuses attention on the misclassified samples. Its chief difference is that its gradient 
is continuous, whereas the gradient of J, is not. Thus, Jy presents a smoother surface 
to search (Fig. 5.11). Unfortunately, J, is so smooth near the boundary of the solution 
region that the sequence of weight vectors can converge to a point on the boundary. 
It is particularly embarrassing to spend some time following the gradient merely to 
reach the boundary point a = 0. Another problem with J, is that its value can be 
dominated by the longest sample vectors. Both of these problems are avoided by the 
criterion function 


J,(a) = > 5 CO (33) 


2 
2 Tri 


where now J (a) is the set of samples for which aty < b. (If Y(a) is empty, we define 
J, to be zero.) Thus, J, (a) is never negative, and is zero if and only if aty > b for all 
of the training samples. The gradient of J, is given by 


aty —b 
via we y, 


ly 
and the update rule 


a(1) arbitrary 
a(k +1) = a(k) +(k) D Y y. (34) 
y 


= ily Il? 
Thus the relaxation algorithm becomes 
Algorithm 8 (Batch relaxation with margin) 


1 begin initialize a,7(-),k = 0 


2 dok=k>+1 
3 Ve = {} 
4 j=0 
5 doje gl 
6 if y, is misclassified then Append y; to Y, 
4 until j =n 
t 
s aa+n(k) E iy 
yey 
9 until Y, = {} 
10 returna 


11 end 


As before, we find it easier to prove convergence when the samples are considered 
one at a time rather than jointly, i.e., single-sample rather than batch. We also limit 
our attention to the fixed-increment case, 7(k) = n. Thus, we are again led to consider 
a sequence y!, y”,... formed from those samples that call for the weight vector to be 
corrected. The single-sample correction rule analogous to Eq. 33 is 


a(1) arbitrary 
—at(k)y* 
a(k + 1) = a(k) 4 y oe y”, 


where a'(k)y* < b for all k. The algorithm is: 


(35) 
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Algorithm 9 (Single-sample relaxation with margin) 


1 begin initialize a,7(-),k = 0 


2 dok—k>+1 

3 if yz is misclassified then a—a-+ n(k) ys 
4 until all patterns properly classified 

5 returna 

6 end 


This algorithm is known as the single-sample relaxation rule with margin, and it 
has a simple geometrical interpretation. The quantity 


b—al(k)y* 


rh) = Tay 


(36) 


is the distance from a(k) to the hyperplane aty% = b. Since y*/||y*|| is the unit 
normal vector for the hyperplane, Eq. 35 calls for a(k) to be moved a certain fraction 
7 of the distance from a(k) to the hyperplane. If 7 = 1, a(k) is moved exactly to the 


hyperplane, so that the “tension” created by the inequality a*(k)y* < b is “relaxed” 
(Fig. 5.14). From Eq. 35, after a correction, 
a (k+1)y*=b= (1 — n)(a' (k)y" — b). (37) 


If y < 1, then a’(k + 1)y* is still less than b, while if y > 1, then at(k + 1)y* is 

greater than b. These conditions are referred to as underrelaxation and overrelaxation, 

respectively. In general, we shall restrict y to the range 0 < 7 < 2 (Figs. 5.14 & 5.15). 
y2 


Figure 5.14: In each step of a basic relaxation algorithm, the weight vector is moved 
a proportion 77 of the way towards the hyperplane defined by aty" = b. 


5.6.2 Convergence Proof 


When the relaxation rule is applied to a set of linearly separable samples, the number 
of corrections may or may not be finite. If it is finite, then of course we have obtained 
a solution vector. If it is not finite, we shall see that a(k) converges to a limit vector 
on the boundary of the solution region. Since the region in which aty > b is contained 
in a larger region where aty > 0 if b > 0, this implies that a(k) will enter this larger 
region at least once, eventually remaining there for all k greater than some finite ko. 


UNDER- 
RELAXATION 


OVER- 
RELAXATION 
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ay 


ay 


Figure 5.15: At the left, underrelaxation (7 < 1) leads to needlessly slow descent, or 
even failure to converge. Overrelaxation (1 < y < 2, shown in the middle) describes 
overshooting; nevertheless convergence will ultimately be achieved. 


The proof depends upon the fact that if A is any vector in the solution region — 
i.e., any vector satisfying ay; > b for all i — then at each step a(k) gets closer to a. 
This fact follows at once from Eq. 35, since 


(b—a'(k)y*) 


a(k-+ 1) — al? = at) — al? 2n a- a(ayy'y! 
5 b— at k ky2 
a di 
and 

(á-—a(k)y* >b -— a (k)y" > 0, (39) 

so that 

-aat ky2 

jak +1) — Al? < lla(x) — all? —n(2 LE) (40) 


ly*l? 


Since we restrict 7 to the range 0 < 7 < 2, it follows that ||a(k+1)—al| < ||a(k)—âll. 
Thus, the vectors in the sequence a(1), a(2),... get closer and closer to a, and in the 
limit as k goes to infinity the distance ||a(k) — â|| approaches some limiting distance 
r(a). This means that as k goes to infinity a(k) is confined to the surface of a 
hypersphere with center â and radius r(&). Since this is true for any â in the solution 
region, the limiting a(k) is confined to the intersection of the hyperspheres centered 
about all of the possible solution vectors. 

We now show that the common intersection of these hyperspheres is a single point 
on the boundary of the solution region. Suppose first that there are at least two 
points a’ and a” on the common intersection. Then lla? — al] = ||a” — â|| for every a 
in the solution region. But this implies that the solution region is contained in the 
(d — 1)-dimensional hyperplane of points equidistant from a’ to a”, whereas we know 
that the solution region is d-dimensional. (Stated formally, if ty; > 0 for i = 1,...,n, 
then for any d-dimensional vector v, we have (a + ev)ty > 0 for i = 1,...,n if e is 
sufficiently small.) Thus, a(k) converges to a single point a. This point is certainly 
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not inside the solution region, for then the sequence would be finite. It is not outside 
either, since each correction causes the weight vector to move y times its distance 
from the boundary plane, thereby preventing the vector from being bounded away 
from the boundary forever. Hence the limit point must be on the boundary. 


5.7 Nonseparable Behavior 


The Perceptron and relaxation procedures give us a number of simple methods for 
finding a separating vector when the samples are linearly separable. All of these 
methods are called error-correcting procedures, because they call for a modification 
of the weight vector when and only when an error is encountered. Their success on 
separable problems is largely due to this relentless search for an error-free solution. 
In practice, one would only consider the use of these methods if there was reason to 
believe that the error rate for the optimal linear discriminant function is low. 

Of course, even if a separating vector is found for the training samples, it does 
not follow that the resulting classifier will perform well on independent test data. 
A moment’s reflection will show that any set of fewer than 2d samples is likely to 
be linearly separable — a matter we shall return to in Chap. ??. Thus, one should 
use several times that many design samples to overdetermine the classifier, thereby 
ensuring that the performance on training and test data will be similar. Unfortunately, 
sufficiently large design sets are almost certainly not linearly separable. This makes it 
important to know how the error-correction procedures will behave when the samples 
are nonseparable. 

Since no weight vector can correctly classify every sample in a nonseparable set (by 
definition), it is clear that the corrections in an error-correction procedure can never 
cease. Each algorithm produces an infinite sequence of weight vectors, any member 
of which may or may not yield a useful “solution.” The exact nonseparable behavior 
of these rules has been studied thoroughly in a few special cases. It is known, for 
example, that the length of the weight vectors produced by the fixed-increment rule 
are bounded. Empirical rules for terminating the correction procedure are often based 
on this tendency for the length of the weight vector to fluctuate near some limiting 
value. From a theoretical viewpoint, if the components of the samples are integer- 
valued, the fixed-increment procedure yields a finite-state process. If the correction 
process is terminated at some arbitrary point, the weight vector may or may not be in 
a good state. By averaging the weight vectors produced by the correction rule, one can 
reduce the risk of obtaining a bad solution by accidentally choosing an unfortunate 
termination time. 

A number of similar heuristic modifications to the error-correction rules have been 
suggested and studied empirically. The goal of these modifications is to obtain ac- 
ceptable performance on nonseparable problems while preserving the ability to find a 
separating vector on separable problems. A common suggestion is the use of a vari- 
able increment n(k), with n(k) approaching zero as k approaches infinity. The rate 
at which 7(k) approaches zero is quite important. If it is too slow, the results will 
still be sensitive to those training samples that render the set nonseparable. If it is 
too fast, the weight vector may converge prematurely with less than optimal results. 
One way to choose n(k) is to make it a function of recent performance, decreasing 
it as performance improves. Another way is to program n(k) by a choice such as 
n(k) = n(1)/k. When we examine stochastic approximation techniques, we shall see 
that this latter choice is the theoretical solution to an analogous problem. Before we 
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take up this topic, however, we shall consider an approach that sacrifices the ability 
to obtain a separating vector for good compromise performance on both separable 
and nonseparable problems. 


5.8 Minimum Squared Error Procedures 


5.8.1 Minimum Squared Error and the Pseudoinverse 


The criterion functions we have considered thus far have focussed their attention on 
the misclassified samples. We shall now consider a criterion function that involves all 
of the samples. Where previously we have sought a weight vector a making all of the 
inner products aty; positive, now we shall try to make aty; = b;, where the b; are 
some arbitrarily specified positive constants. Thus, we have replaced the problem of 
finding the solution to a set of linear inequalities with the more stringent but better 
understood problem of finding the solution to a set of linear equations. 

The treatment of simultaneous linear equations is simplified by introducing matrix 
notation. Let Y be the n-by-d matrix (å = d + 1) whose ith row is the vector y?, 
and let b be the column vector b = (b1, ...,bn)*. Then our problem is to find a weight 
vector a satisfying 


Yio Yu +: Ya ag by 
Yoo Ya > Ya ay ba 
ad = 
Yno Yn eae Yna bn 
or Ya = b. (41) 


If Y were nonsingular, we could write a = Y~'b and obtain a formal solution at once. 
However, Y is rectangular, usually with more rows than columns. When there are 
more equations than unknowns, a is overdetermined, and ordinarily no exact solution 
exists. However, we can seek a weight vector a that minimizes some function of the 
error between Ya and b. If we define the error vector e by 


e= Ya—b (42) 


then one approach is to try to minimize the squared length of the error vector. This 
is equivalent to minimizing the sum-of-squared-error criterion function 


n 
J.(a) = ¡Ya — b|? = Y (a'y; — b:)}?. (43) 
i=1 
The problem of minimizing the sum of squared error is a classical one. It can be 
solved by a gradient search procedure, as we shall see in Sect. ??. A simple closed-form 
solution can also be found by forming the gradient 


VJ; = > Aa y; — bi)y; = 2Y'(Ya — b) (44) 


i=1 
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and setting it equal to zero. This yields the necessary condition 


Y Ya = Y'b, (45) 


and in this way we have converted the problem of solving Ya = b to that of solving 
YtYa = Y‘b. This celebrated equation has the great advantage that the d-by-d 
matrix Y°Y is square and often nonsingular. If it is nonsingular, we can solve for a 
uniquely as 


a = (YY) *Y'b 
= Yib, (46) 
where the d-by-n matrix 
yt =(Y'Y) tY’ (47) 


is called the pseudoinverse of Y. Note that if Y is square and nonsingular, the pseu- 
doinverse coincides with the regular inverse. Note also that YY = I, but YY' 4 1 
in general. However, a minimum-squared-error (MSE) solution always exists. In 
particular, if YT is defined more generally by 


Y's lim (YY + *Y*, (48) 


it can be shown that this limit always exists, and that a = Y'b is an MSE solution 
to Ya = b. 

The MSE solution depends on the margin vector b, and we shall see that different 
choices for b give the solution different properties. If b is fixed arbitrarily, there is 
no reason to believe that the MSE solution yields a separating vector in the linearly 
separable case. However, it is reasonable to hope that by minimizing the squared- 
error criterion function we might obtain a useful discriminant function in both the 
separable and the nonseparable cases. We shall now examine two properties of the 
solution that support this hope. 


Example 1: Constructing a linear classifier by matrix pseudoinverse | 


Suppose we have the following two-dimensional points for two categories: wy: 
(1,2)* and (2,0)*, and wa: (3,1)* and (2,3)*, as shown in black and red, respectively, 
in the figure. 

Our matrix Y is therefore 


i 1 2 
E 2 -0 
Y= ay ES 
=1 —2 -3 


and after a few simple calculations we find that its pseudoinverse is 


5/4 13/12 3/4 7/12 
Y! = lim(Y*Y + dl) -'Y' = 1/2 -1/6 -1/2 -1/6 
er ( =e © =1/3 
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by means of a pseudoinverse technique. 


We arbitrarily let all the margins be equal, i.e., b = (1,1,1,1)'. Our solution is 
a = Y'b = (11/3, —4/3, —2/3)', and leads to the decision boundary shown in the 
figure. Other choices for b would typically lead to different decision boundaries, of 
course. 


5.8.2 Relation to Fisher’s Linear Discriminant 


In this section we shall show that with the proper choice of the vector b, the MSE 
discriminant function aty is directly related to Fisher’s linear discriminant. To do 
this, we must return to the use of linear rather than generalized linear discriminant 
functions. We assume that we have a set of n d-dimensional samples xj, ..., Xn, 11 of 
which are in the subset Dı labelled w1, and na of which are in the subset Da labelled 
wa. Further, we assume that a sample y; is formed from x; by adding a threshold 
component zo = 1 to make an augmented pattern vector. Further, if the sample is 
labelled wa, then the entire pattern vector is multiplied by —1 — the “normlization” 
we saw in Sect. 5.4.1. With no loss in generality, we can assume that the first nı 
samples are labelled w and the second nz are labelled w2. Then the matrix Y can 
be partitioned as follows: 


E 


—1) —X> 


where 1; is a column vector of n; ones, and X; is an n;-by-d matrix whose rows are 
the samples labelled w;. We partition a and b correspondingly, with 


and with 
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npl 
e | 


We shall now show that this special choice for b links the MSE solution to Fisher’s 
linear discriminant. 
We begin by writing Eq. 47 for a in terms of the partitioned matrices: 


Ix, A AI A Sl] 


By defining the sample means m; and the pooled sample scatter matrix Sw as 
1 ; 
m;=— Y x t= 1,0 (50) 


and 


Sw =D XO (x — m;)(x — mj)’, (51) 


we can multiply the matrices of Eq. 49 and obtain 
| n (nımı + n2ma)* | | Wo | _ 0 
(nm, + na2ma) Sw +n1mm + n2mamí wi |n(m-—m>) |" 


This can be viewed as a pair of equations, the first of which can be solved for wo in 
terms of w: 


wo =—m'w, (52) 


where m is the mean of all of the samples. Substituting this in the second equation 
and performing a few algebraic manipulations, we obtain 


Į nin 
[Sw | “> (mi m>)(m, — my)! w = mı- Mo. (53) 


Since the vector (my, — m2)(m;ı — m2)*w is in the direction of mı — ma for any value 
of w, we can write 


nina 
n2 


(mı — m2)(m, — mə)*w = (1 — a)(mı — mə), 
where a is some scalar. Then Eq. 53 yields 


w =anSj7 (mı — mə), (54) 


which, except for an unimportant scale factor, is identical to the solution for Fisher’s 
linear discriminant. In addition, we obtain the threshold weight wo and the following 
decision rule: Decide w1 if w*(x — m) > 0; otherwise decide wa. 
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5.8.3 Asymptotic Approximation to an Optimal Discriminant 


Another property of the MSE solution that recommends its use is that if b = 1, it 
approaches a minimum mean-squared-error approximation to the Bayes discriminant 
function 


go(x) = P(wi|x) — P(walx) (55) 


in the limit as the number of samples approaches infinity. To demonstrate this fact, 
we must assume that the samples are drawn independently, identically distributed 
(i.i.d.) according to the probability law 


p(x) = p(x|w1)P(w1) + p(x|w2)P(w2). (56) 


In terms of the augmented vector y, the MSE solution yields the series expansion 
g(x) = aty, where y = y(x). If we define the mean-squared approximation error by 


a= / [aty — go(x)|?p(x) dx, (57) 


then our goal is to show that e? is minimized by the solution a = Y'1,. 

The proof is simplified if we preserve the distinction between category wı and 
category wa samples. In terms of the unnormalized data, the criterion function J, 
becomes 


yeu yEYa 
1 1 
= nl Y a'y-1} +2 Y (aty+1)]. (58) 
n nı n na 
yEY1 yEYa 


Thus, by the law of large numbers, as n approaches infinity (1/n)J,(a) approaches 


J(a) = P(u)Er[(a'y — 1)7] + PlwoJEs[(a"y + 1)?], (59) 


with probability one, where 


Erltaly — 1)? = J (ty —DPpGeja) de 


and 


Es[(afy + 1)?] = fey + 1)?p(x|we) dx. 


Now, if we recognize from Eq. 55 that 


we see that 
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Jej = J (aty — 1)?p(x,w1) dx + I (aty + 1)*p(x, wa) dx 

= J (aty)?p(x) dx — 2 7 atygo(x)p(x) dx +1 

= ES = go(x)]?p(x) dx + [1 = J èw dx|. (60) 
A A, 


e indep. of a 


The second term in this sum is independent of the weight vector a. Hence, the a 
that minimizes J, also minimizes e? — the mean-squared-error between aty and g(x) 
(Fig. 5.16). In Chap. ?? we shall see that analogous properties also holds for many 
multilayer networks. 
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Figure 5.16: The top figure shows two class-conditional densities, and the middle figure 
the posteriors, assuming equal priors. Minimizing the MSE error also minimizes the 
mean-squared-error between aty and the discriminant function g(x) (here a 7th-order 
polynomial) measured over the data distribution, as shown at the bottom. Note that 
the resulting g(x) best approximates go(x) in the regions where the data points lie. 


This result gives considerable insight into the MSE procedure. By approximat- 
ing go(x), the discriminant function aty gives direct information about the posterior 
probabilities P(w,|x) = (1 + go)/2 and P(walx) = (1 — go)/2. The quality of the 
approximation depends on the functions y;(x) and the number of terms in the expan- 
sion aty. Unfortunately, the mean-square-error criterion places emphasis on points 


LMS RULE 
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where p(x) is larger, rather than on points near the decision surface go(x) = 0. Thus, 
the discriminant function that “best” approximates the Bayes discriminant does not 
necessarily minimize the probability of error. Despite this property, the MSE solution 
has interesting properties, and has received considerable attention in the literature. 
We shall encounter the mean-square approximation of go(x) again when we consider 
stochastic approximation methods and multilayer neural networks. 


5.8.4 The Widrow-Hoff Procedure 


We remarked earlier that J;(a) = ||Ya — bl]? could be minimized by a gradient 
descent procedure. Such an approach has two advantages over merely computing the 
pseudoinverse: (1) it avoids the problems that arise when Y*Y is singular, and (2) 
it avoids the need for working with large matrices. In addition, the computation 
involved is effectively a feedback scheme which automatically copes with some of the 
computational problems due to roundoff or truncation. Since VJ, = 2Y*(Ya — b), 
the obvious update rule is 


a(1) arbitrary 
a(k +1) =a(k) + (k)Y* (Ya, — b). 


In Problem 24 you are asked to show that if n(k) = (1)/k, where 7(1) is any positive 
constant, then this rule generates a sequence of weight vectors that converges to a 
limiting vector a satisfying 


Y (Ya —b)=0. 


Thus, the descent algorithm always yields a solution regardless of whether or not Y*Y 
is singular. 

While the d-by-d matrix Y°Y is usually smaller than the d-by-n matrix Yt, the 
storage requirements can be reduced still further by considering the samples sequen- 
tially and using the Widrow-Hoff or LMS rule (least-mean-squared): 


a(1) arbitrary (61) 
a(k + 1) = a(k) + (k) (br — a(k)'y*)y*, 

or in algorithm form: 

Algorithm 10 (LMS) 


1 begin initialize a, b, criterion 0,n(-),k=0 
2 do k=k>+1 

3 acatn(k)(bp — aty*)y* 

4 until n(k)(b; — aly*)y* < 0 
5 
6 


return a 
end 


At first glance this descent algorithm appears to be essentially the same as the re- 
laxation rule. The primary difference is that the relaxation rule is an error-correction 
rule, so that a*(k)y* does not equal bz, and thus the corrections never cease. There- 
fore, n(k) must decrease with k to obtain convergence, the choice n(k) = n(1)/k being 
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common. Exact analysis of the behavior of the Widrow-Hoff rule in the deterministic 
case is rather complicated, and merely indicates that the sequence of weight vectors 
tends to converge to the desired solution. Instead of pursuing this topic further, we 
shall turn to a very similar rule that arises from a stochastic descent procedure. We 
note, however, that the solution need not give a separating vector, even if one exists, 
as shown in Fig. 5.17 (Computer exercise 10). 


Figure 5.17: The LMS algorithm need not converge to a separating hyperplane, even 
if one exists. Since the LMS solution minimizes the sum of the squares of the distances 
of the training points to the hyperplane, for this exmple the plane is rotated clockwise 
compared to a separating hyperplane. 


5.8.5 Stochastic Approximation Methods 


All of the iterative descent procedures we have considered thus far have been described 
in deterministic terms. We are given a particular set of samples, and we generate a 
particular sequence of weight vectors. In this section we digress briefly to consider 
an MSE procedure in which the samples are drawn randomly, resulting in a random 
sequence of weight vectors. We will return in Chap. ?? to the theory of stochastic 
approximation though here some of the main ideas will be presented without proof. 

Suppose that samples are drawn independently by selecting a state of nature with 
probability P(w;) and then selecting an x according to the probability law p(x|w;). 
For each x we let 0 be its label, with 0 = +1 if x is labelled w; and 9 = —1 if x 
is labelled wə. Then the data consist of an infinite sequence of independent pairs 
(x, 01), (x2,02),..., (Xx,0%),.... Even though the label variable 0 is binary-valued it 
can be thought of as a noisy version of the Bayes discriminant function go(x). This 
follows from the observation that 


P(0 = 1x) = P(w|x) 


El 


and 


P(0 = —1|x) = P(walx), 
so that the conditional mean of 0 is given by 


En1x19] = Y OP(Alx) = P(wilx) — P(walx) = go(x). (62) 
0 
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Suppose that we wish to approximate go(x) by the finite series expansion 


d 
g(x) = aty = 2 aiyi(x), 


where both the basis functions y;(x) and the number of terms d are known. Then we 
can seek a weight vector â that minimizes the mean-squared approximation error 


e =€l(a y — go(x))”]- (63) 


Minimization of e? would appear to require knowledge of Bayes discriminant go(x). 
However, as one might have guessed from the analogous situation in Sect. 5.8.3, it 
can be shown that the weight vector 4 that minimizes e? also minimizes the criterion 
function 


Im(a) =€l(a y — 97). (64) 
This should also be plausible from the fact that 0 is essentially a noisy version of go(x) 
(Fig. ??). Since the gradient is 

VJ = 2E [(a*y — 0)y], (65) 


we can obtain the closed-form solution 


a =€lyy"] *E[0y]. (66) 


Thus, one way to use the samples is to estimate € [yy*] and €[@y], and use Eq. 66 to 
obtain the MSE optimal linear discriminant. An alternative is to minimize J,, (a) by 
a gradient descent procedure. Suppose that in place of the true gradient we substitute 
the noisy version 2(a*y;, — 0x)yx. This leads to the update rule 


a(k + 1) = a(k) + (Ox — a (k)Yk)Yk; (67) 


which is basically just the Widrow-Hoff rule. It can be shown (Problem ??) that if 
Elyy'] is nonsingular and if the coefficients n(k) satisfy 


lim Y n(k) = +00 (68) 
and 
lim Y 7?(k) < co (69) 
k=1 


then a(k) converges to á in mean square: 
im E llla(k) — al]?] = 0. (70) 


The reasons we need these conditions on n(k) are simple. The first condition keeps 
the weight vector from converging so fast that a systematic error will remain forever 
uncorrected. The second condition ensures that random fluctuations are eventually 
suppressed. Both conditions are satisfied by the conventional choice n(k) = 1/k. 
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Unfortunately, this kind of programmed decrease of n(k), independent of the problem 
at hand, often leads to very slow convergence. 

Of course, this is neither the only nor the best descent algorithm for minimizing 
Jm. For example, if we note that the matrix of second partial derivatives for Jm is 
given by 

D =2€lyy"l, 


we see that Newton’s rule for minimizing Jm (Eq. 15) is 


a(k + 1) = a(k) + Elyy*]*E[(9 — a*y)y]. 


A stochastic analog of this rule is 


a(k +1) = a(k) + R41 (0k — al(k)yg)yx. (71) 
with 
wir = RE + YkYh @ 
or, equivalently,* 
Ryn (Ryy 1)" 
Ric = Ry . 13 
k+1 k 1L+y Royo to 


This rule also produces a sequence of weight vectors that converges to the optimal 
solution in mean square. Its convergence is faster, but it requires more computation 
per step (Computer exercise 8). 

These gradient procedures can be viewed as methods for minimizing a criterion 
function, or finding the zero of its gradient, in the presence of noise. In the statistical 
literature, functions such as Jm and V Jm that have the form €[f(a,x)] are called 
regression functions, and the iterative algorithms are called stochastic approximation 
procedures. Two well known ones are the Kiefer-Wolfowitz procedure for minimizing a 
regression function, and the Robbins-Monro procedure for finding a root of a regression 
function. Often the easiest way to obtain a convergence proof for a particular descent 
or approximation procedure is to show that it satisfies the convergence conditions for 
these more general procedures. Unfortunately, an exposition of these methods in their 
full generality would lead us rather far afield, and we must close this digression by 
referring the interested reader to the literature. 


5.9 The Ho-Kashyap Procedures 


5.9.1 The Descent Procedure 


The procedures we have considered thus far differ in several ways. The Perceptron 
and relaxation procedures find separating vectors if the samples are linearly separable, 
but do not converge on nonseparable problems. The MSE procedures yield a weight 
vector whether the samples are linearly separable or not, but there is no guarantee 


* This recursive formula for computing Rg, which is roughly (1/k)€[yy*]7!, cannot be used if Ry 


is singular. The equivalence of Eq. 72 and Eq. 73 follows from Problem ?? of Chap. ??. 
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that this vector is a separating vector in the separable case (Fig. 5.17). If the margin 
vector b is chosen arbitrarily, all we can say is that the MSE procedures minimize 
|| Ya — b||?. Now if the training samples happen to be linearly separable, then there 
exists an â and a b such that 


Ya=b>0, 


where by b> 0, we mean that every component of b is positive. Clearly, were we 
to take b = b and apply the MSE procedure, we would obtain a separating vector. 
Of course, we usually do not know b beforehand. However, we shall now see how the 
MSE procedure can be modified to obtain both a separating vector a and a margin 
vector b. The underlying idea comes from the observation that if the samples are 
separable, and if both a and b in the criterion function 


J,(a, b) = [Ya — bl]? (74) 


are allowed to vary (subject to the constraint b > 0), then the minimum value of J, 
is zero, and the a that achieves that minimum is a separating vector. 

To minimize Js, we shall use a modified gradient descent procedure. The gradient 
of J, with respect to a is given by 


Vad, = 2Y* (Ya — b), (75) 
and the gradient of J, with respect to b is given by 


VpJ. = —2(Ya — b). (76) 


For any value of b, we can always take 


a= Yİb, (77) 


thereby obtaining VaJs = 0 and minimizing J, with respect to a in one step. We 
are not so free to modify b, however, since we must respect the constraint b > 0, 
and we must avoid a descent procedure that converges to b = 0. One way to prevent 
b from converging to zero is to start with b > O and to refuse to reduce any of its 
components. We can do this and still try to follow the negative gradient if we first set 
all positive components of Vp J, to zero. Thus, if we let |v| denote the vector whose 
components are the magnitudes of the corresponding components of v, we are led to 
consider an update rule for the margin of the form 


b(k +1) = b(k) — 15 [Vod, — |VpJs!]. (78) 


Using Eqs. 76 & 77, and being a bit more specific, we obtain the Ho-Kashyap rule for 
minimizing Js(a, b): 


(79) 


b(1) > 0 but otherwise arbitrary 
b(k +1) = a(k) + 2n(k)e*(k), 


where e(k) is the error vector 
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e(k) = Ya(k) — b(k), (80) 
et (k) is the positive part of the error vector 
et (k) = 5 (elk) + 1e(4)), (81) 
and 
a(k) = Y'b(k), E IO RR (82) 


Thus if we let bmin be a small convergence criterion and Abs[e] denote the positive 
part of e, our algorithm is: 


Algorithm 11 (Ho-Kashyap) 


1 begin initialize a,b,7(-) < 1, criteria bmin, kmax 
dok=k>+1 

e Ya—b 

et — 1/2(e + Absfe]) 

b — a + 2n(k)e* 

a — Yb 

if Abs[e] < bmin then return a,b and exit 
until k = kmaz 
9 Print NO SOLUTION FOUND 
10 end 


AID AHR o WD 


Since the weight vector a(k) is completely determined by the margin vector b(k), 
this is basically an algorithm for producing a sequence of margin vectors. The initial 
vector b(1) is positive to begin with, and if y > 0, all subsequent vectors b(k) are 
positive. We might worry that if none of the components of e(k) is positive, so that 
b(k) stops changing, we might fail to find a solution. However, we shall see that in 
that case either e(k) = 0 and we have a solution, or e(k) < 0 and we have proof that 
the samples are not linearly separable. 


5.9.2 Convergence Proof 


We shall now show that if the samples are linearly separable, and if 0 < 7 < 1, then 
the Ho-Kashyap algorithm will yield a solution vector in a finite number of steps. To 
make the algorithm terminate, we should add a terminating condition stating that 
corrections cease once a solution vector is obtained or some large criterion number of 
iterations have occurred. However, it is mathematically more convenient to let the 
corrections continue and show that the error vector e(k) either becomes zero for some 
finite k, or converges to zero as k goes to infinity. 

It is clear that either e(k) = 0 for some k — say ko — or there are no zero vectors 
in the sequence e(1), e(2),... In the first case, once a zero vector is obtained, no further 
changes occur to a(k), b(k), or e(k), and Ya(k) = b(k) > 0 for all k > ko. Thus, if 
we happen to obtain a zero error vector, the algorithm automatically terminates with 
a solution vector. 

Suppose now that e(k) is never zero for finite k. To see that e(k) must nevertheless 
converge to zero, we begin by asking whether or not we might possibly obtain an e(k) 
with no positive components. This would be most unfortunate, since we would have 
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Ya(k) < b(k), and since e*(k) would be zero, we would obtain no further changes 
in a(k), b(k), or e(k). Fortunately, this can never happen if the samples are linearly 
separable. A proof is simple, and is based on the fact that if Y*Ya(k) = Y*b, then 
Y e(k) = 0. But if the samples are linearly separable, there exists an â and a b>0 
such that 


Thus, 


and since all the components of b are positive, either e(k) = 0 or at least one of the 
components of e(k) must be positive. Since we have excluded the case e(k) = 0, it 
follows that et(k) cannot be zero for finite k. 

The proof that the error vector always converges to zero exploits the fact that the 
matrix Y Y? is symmetric, positive semidefinite, and satisfies 


(YYD'(Y YT) = Y YT. (83) 
Although these results are true in general, for simplicity we demonstrate them only for 
the case where Y‘Y is nonsingular. In this case Y Y? = Y(Y*Y)~!Y‘, and the sym- 
metry is evident. Since Y‘Y is positive definite, so is (Y°Y)~!; thus, bY(Y°Y)~'Y‘b > 
O for any b, and YY? is at least positive semidefinite. Finally, Eq. 83 follows from 
(YYD(YYD = [Y (YY) YY (YY) *Y?]. 

To see that e(k) must converge to zero, we eliminate a(k) between Eqs. 80 & 82 

and obtain 
e(k) = (YY! — I)b(k). 


Then, using a contant learning rate and Eq. 79 we obtain the recursion relation 


(YY? — D(b(k) + 2ne* (k)) 
= e(k)+2n(YY'—Det(k), (84) 


e(k + 1) 
so that 


Tlel +1)? = Tle) +ne*(k)(Y Y? — Det (k) + In(Y Y? — De*(%)[]?. 


Both the second and the third terms simplify considerably. Since e*(k)Y = 0, the 
second term becomes 


ne'(k)(YY! — De*(k) = —ne'(k)e**(k) = —nlle* (rk), 
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the nonzero components of e+ (k) being the positive components of e(k). Since Y Y* 
is symmetric and is equal to (YY')'(YY"), the third term simplifies to 


InmYY?- De* (k)? = Pe(k)(YY'—T'(YY! — De* (k) 
WP lle* (k)? =P e* (k)YYtet (k), 


and thus we have 


Fle)? — lle(k + 1)|?) = n0 — n)lle* (k)||? + n’et*(k)YY tet (k). (85) 


Since e+(k) is nonzero by assumption, and since YY" is positive semidefinite, 
ek)? > lle(k + 1)|/? if 0 < 7 < 1. Thus the sequence |le(1)||?, lle(2)112, ... is 
monotonically decreasing and must converge to some limiting value |le||?.. But for 
convergence to take place, e*(k) must converge to zero, so that all the positive com- 
ponents of e(k) must converge to zero. Since e'(k)b = 0 for all k, it follows that all of 
the components of e(k) must converge to zero. Thus, if 0 < y < 1 and if the samples 
are linearly separable, a(k) will converge to a solution vector as k goes to infinity. 

If we test the signs of the components of Ya(k) at each step and terminate the 
algorithm when they are all positive, we will in fact obtain a separating vector in a 
finite number of steps. This follows from the fact that Ya(k) = b(k) + e(k), and that 
the components of b(k) never decrease. Thus, if bmin is the smallest component of b(1) 
and if e(k) converges to zero, then e(k) must enter the hypersphere ||e(k)|| = bmin after 
a finite number of steps, at which point Ya(k) > 0. Although we ignored terminating 
conditions to simplify the proof, such a terminating condition would always be used 
in practice. 


5.9.3 Nonseparable Behavior 


If the convergence proof just given is examined to see how the assumption of sepa- 
rability was employed, it will be seen that it was needed twice. First, the fact that 
e*(l)b = 0 was used to show that either e(%) = 0 for some finite k, or et (k) is never 
zero and corrections go on forever. Second, this same constraint was used to show 
that if e*(k) converges to zero, e(k) must also converge to zero. 

If the samples are not linearly separable, it no longer follows that if e” (k) is zero 
then e(k) must be zero. Indeed, on a nonseparable problem one may well obtain a 
nonzero error vector having no positive components. If this occurs, the algorithm 
automatically terminates and we have proof that the samples are not separable. 

What happens if the patterns are not separable, but e*(k) is never zero? In this 
case it still follows that 


e(k +1) =e(k) + 2n(YY' — De*(k) (86) 
and 
(lle)? — |le(k + DI?) = n0 — n)llet (AI? + ret" (k)YVte* (k). (87) 


Thus, the sequence |le(1)||?, ||e(2)||?, ... must still converge, though the limiting value 
\|e|/? cannot be zero. Since convergence requires that e*(k) = 0 for some finite k, 
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or et(k) converges to zero while ||e(k)|| is bounded away from zero. Thus, the Ho- 
Kashyap algorithm provides us with a separating vector in the separable case, and 
with evidence of nonseparability in the nonseparable case. However, there is no bound 
on the number of steps needed to disclose nonseparability. 


5.9.4 Some Related Procedures 


If we write Y? = (Y*Y)"1Y* and make use of the fact that Y‘e(k) = 0, we can 
modify the Ho-Kashyap rule as follows 


b(1) > 0 but otherwise arbitrary 
a(l) = Ytb(1) (88) 
b(k+1) = bk) +n(e(k) + le(%))) 
a(k+1) = a(k)+nY"le(k)], 
where, as usual, 
e(k) = Ya(k) — b(k). (89) 


This then gives the algorithm for fixed learning rate: 


Algorithm 12 (Modified Ho-Kashyap) 


1 begin initialize a,b, < 1, criterion bmin, kmax 


2 dok=k+1 

3 e+ Ya-b 

4 et — 1/2(e + Absfe}) 

5 b — b + 2n(k)(e + Abs[e]) 

6 a Yb 

7 if Absle] < bmin then return a,b and exit 
8 until k= Kmax 

9 print NO SOLUTION FOUND 


10 end 


This algorithm differs from the Perceptron and relaxation algorithms for solving 
linear inequalities in at least three ways: (1) it varies both the weight vector a and 
the margin vector b, (2) it provides evidence of nonseparability, but (3) it requires 
the computation of the pseudoinverse of Y. Even though this last computation need 
be done only once, it can be time consuming, and it requires special treatment if YY 
is singular. An interesting alternative algorithm that resembles Eq. 88 but avoids the 
need for computing Yt is 


b(1) > 0 but otherwise arbitrary 

all) = arbitrary (90) 
b(k +1) = b(k) + (e(%) + |e(k)|) 
a(k+1) = a(k)+9RY'je(k)| 


where R is an arbitrary, constant, postive-definite d-by-d matrix. We shall show that 
if 7 is properly chosen, this algorithm also yields a solution vector in a finite number 
of steps, provided that a solution exists. Furthermore, if no solution exists, the vector 
Y le(k)| either vanishes, exposing the nonseparability, or converges to zero. 

The proof is fairly straightforward. Whether the samples are linearly separable or 
not, Eqs. 89 & 90 show that 
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e(k + 1) Ya(k + 1) — b(k +1) 


= (nYRY' — Dle(k)|. 


We can find, then, that the squared magnitude is 


lle(k + iyi? = |e(k)|'(n?*YRY*YRY — 2nYRY' + I)|e(k)|, 
and furthermore 


lell? — le(& + DI? = (Y*le(k)) ACY“ le(k)l), (91) 


where 


A = 2nR — y RY'R. (92) 


Clearly, if ņ is positive but sufficiently small, A will be approximately 27R and hence 
positive definite. Thus, if Y‘|e(k)| 4 0 we will have ||e(k)||? > lle(k + 1)||?. 

At this point we must distinguish between the separable and the nonseparable 
case. In the separable case there exists an á and a b>0 satisfying Ya = b. Thus, if 


le(k)| # 0, 


le(k)| Yâ = |e(k)|"B > 0, 


so that Y*|e(k)| can not be zero unless e(k) is zero. Thus, the sequence ||e(1)]|?, ||e(2)|?, ..- 
is monotonically decreasing and must converge to some limiting value |le||?.. But for 
convergence to take place, Y‘|e(k)| must converge to zero, which implies that |e(k)| 
and hence e(k) must converge to zero. Since e(k) starts out positive and never de- 
creases, it follows that a(k) must converge to a separating vector. Moreover, by the 
same argument used before, a solution must actually be obtained after a finite number 

of steps. 

In the nonseparable case, e(k) can neither be zero nor converge to zero. It may 
happen that Y*le(k)| = 0 at some step, which would provide proof of nonseparability. 
However, it is also possible for the sequence of corrections to go on forever. In this 
case, it again follows that the sequence ||e(1)||?, ||e(2)||?, ... must converge to a limiting 
value |le|/? 4 0, and that Y*Je(k)| must converge to zero. Thus, we again obtain 
evidence of nonseparability in the nonseparable case. 

Before closing this discussion, let us look briefly at the question of choosing 7 and 
R. The simplest choice for R is the identity matrix, in which case A = 291 — y? Y*Y. 
This matrix will be positive definite, thereby assuring convergence, if 0 < y < 2/Amaz, 
where Amas is the largest eigenvalue of Y*Y. Since the trace of Y*Y is both the sum 
of the eigenvalues of Y*Y and the sum of the squares of the elements of Y, one can 
use the pessimistic bound | Y |ly:i{|? in selecting a value for n. 


a 
A more interesting approach is to change 77 at each step, selecting that value that 
maximizes ||e(k)||? — ||e(k + 1)||?. Equations 91 & 92 give 


lle(k)||? — [le(% + D|? = [e(k)| Y (2nR — 1° RY'YR)Y'le(k)]. (93) 


By differentiating with respect to 7, we obtain the optimal value 
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_  Je(k)*YRY"[e(k)| 
mk) = SYRY YRY e] al 


which, for R = I, simplifies to 


/¥*le(*)| ||? 
nk) = (95) 
[YY *le(k)] 11? 
This same approach can also be used to select the matrix R. By replacing R in Eq. 93 
by the symmetric matrix R + 6R and neglecting second-order terms, we obtain 


Alle(x)1? — llek + DI’) = [eY [IR (I - nY YR) + (I— 7 RY Y)0R] Y "[e(k)|. 


Thus, the decrease in the squared error vector is maximized by the choice 
l a-i 
R= -(Y'Y) (96) 
n 


and since yRY* = YT, the descent algorithm becomes virtually identical with the 
original Ho-Kashyap algorithm. 


5.10 Linear Programming Algorithms 


5.10.1 Linear Programming 


The Perceptron, relaxation and Ho-Kashyap procedures are basically gradient de- 
scent procedures for solving simultaneous linear inequalities. Linear programming 
techniques are procedures for maximizing or minimizing linear functions subject to 
linear equality or inequality constraints. This at once suggests that one might be able 
to solve linear inequalities by using them as the constraints in a suitable linear pro- 
gramming problem. In this section we shall consider two of several ways that this can 
be done. The reader need have no knowledge of linear programming to understand 
these formulations, though such knowledge would certainly be useful in applying the 
techniques. 

A classical linear programming problem can be stated as follows: Find a vector 
u = (uz, ..., Um)” that minimizes the linear (scalar) objective function 


z=a'u (97) 


subject to the constraint 


Au > 6, (98) 


where @ is an m-by-1 cost vector, B is an l-by-1 vector, and A is an l-by-m matrix. 
The simplex algorithm is the classical iterative procedure for solving this problem 
(Fig. 5.18). For technical reasons, it requires the imposition of one more constraint, 
viz., u > 0. 

If we think of u as being the weight vector a, this constraint is unacceptable, since 
in most cases the solution vector will have both positive and negative components. 
However, suppose that we write 
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Figure 5.18: Surfaces of constant z = atu are shown in gray, while constraints of the 
form Au.@ are shown in red. The simplex algorithm finds an extremum of z given 
the constraints, i.e., where the gray plan intersects the red at a single point. 


a=a™—a’, (99) 
where 
yl 
a” = 3 (lal +a) (100) 
and 
1 Al 
a = z(lal — a). (101) 


Then both at and a” are nonnegative, and by identifying the components of u with 
the components of at and a”, for example, we can accept the constraint u > 0. 


5.10.2 The Linearly Separable Case 


Suppose that we have a set of n samples yj, ..., yn and we want a weight vector a that 
satisfies aty; > b; > 0 for all i. How can we formulate this as a linear programming 
problem? One approach is to introduce what is called an artificial variable T > 0 by 
writing 


aty; + T > di. 


If 7 is sufficiently large, there is no problem in satisfying these constraints; for example, 
they are satisfied if a = 0 and 7 = max; b;.* However, this hardly solves our original 
problem. What we want is a solution with 7 = 0, which is the smallest value 7 can 
have and still satisfy 7 > 0. Thus, we are led to consider the following problem: 
Minimize 7 over all values of 7 and a that satisfy the conditions aty; > b; and 7 > 0. 


* In the terminology of linear programming, any solution satisfying the constraints is called a feasible 
solution. A feasible solution for which the number of nonzero variables does not exceed the number 
of constraints (not counting the simplex requirement for nonnegative variables) is called a basic 
feasible solution. Thus, the solution a = 0 and T = max; b; is a basic feasible solution. Possession 
of such a solution simplifies the application of the simplex algorithm. 
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If the answer is zero, the samples are linearly separable, and we have a solution. If the 
answer is positive, there is no separating vector, but we have proof that the samples 
are nonseparable. 

Formally, our problem is to find a vector u that minimizes the objective function 
z = atu subject to the constraints Au > 8 and u > 0, where 


yi -yi 1 4 bi 
ya —y5 1 a 0 bo 
A= y A A u = a y) a= 0 > B = 
i i : T 1 i 
ys =y, 1 bn 


Thus, the linear programming problem involves m = 2d +1 variables and 1 = n 
constraints, plus the simplex algorithm constraints u > 0. The simplex algorithm will 
find the minimum value of the objective function z = atu = 7 in a finite number 
of steps, and will exhibit a vector û yielding that value. If the samples are linearly 
separable, the minimum value of r will be zero, and a solution vector â can be obtained 
from û. If the samples are not separable, the minimum value of 7 will be positive. 
The resulting û is usually not very useful as an approximate solution, but at least one 
obtains proof of nonseparability. 


5.10.3 Minimizing the Perceptron Criterion Function 


In the vast majority of pattern classification applications we cannot assume that the 
samples are linearly separable. In particular, when the patterns are not separable, 
one still wants to obtain a weight vector that classifies as many samples correctly as 
possible. Unfortunately, the number of errors is not a linear function of the compo- 
nents of the weight vector, and its minimization is not a linear programming problem. 
However, it turns out that the problem of minimizing the Perceptron criterion func- 
tion can be posed as a problem in linear programming. Since minimization of this 
criterion function yields a separating vector in the separable case and a reasonable 
solution in the nonseparable case, this approach is quite attractive. 
Recall from Sect. ?? that the basic Perceptron criterion function is given by 


J (a) = Y ay), (102) 
yey 
where Y(a) is the set of training samples misclassified by a. To avoid the useless 
solution a = 0, we introduce a positive margin vector b and write 


J,(a) = Y (bi -a'y), (103) 


yey’ 


where y; € Y” if ay; < bi. Clearly, J} is a piecewise-linear function of a, not a 
linear function, and linear programming techniques are not immediately applicable. 
However, by introducing n artificial variables and their constraints we can construct 
an equivalent linear objective function. Consider the problem of finding vectors a and 
T that minimize the linear function 


n 
a= > Ti 
i=1 
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subject to the constraints 


Tk > 0 and Ti = bi = a’y;. 


Of course for any fixed value of a, the minimum value of z is exactly equal to J; (a), 
since under these constraints the best we can do is to take 7, = max(0, b; — a’y;]. If 
we minimize z over t and a, we shall obtain the minimum possible value of J;(a). 
Thus, we have converted the problem of minimizing J,(a) to one of minimizing a 
linear function z subject to linear inequality constraints. Letting u, denote an n- 
dimensional unit vector, we obtain the following problem with m = 2d + n variables 
and l = n constraints: Minimize atu subject to Au > 8 and u > 0, where 


t t 
yi =y i 1 0 0 1 
ys -y$ 0 1 0 at 0 ba 

A= ‘ > 5 , u= a )A= 0 ’ B = 

: : T T; 


The choice a = 0 and 7; = b; provides a basic feasible solution to start the simplex 
algorithm, and the simplex algorithm will provide an â minimizing J; (a) in a finite 
number of steps. 

We have shown two ways to formulate the problem of finding a linear discriminant 
function as a problem in linear programming. There are other possible formulations, 
the ones involving the so-called dual problem being of particular interest from a com- 
putational standpoint. Generally speaking, methods such as the simplex method are 
merely sophisticated gradient descent methods for extremizing linear functions sub- 
ject to linear constraints. The coding of a linear programming algorithm is usually 
more complicated than the coding of the simpler descent procedures we described ear- 
lier, and these descent procedures generalize naturally to multilayer neural networks. 
However, general purpose linear programming packages can often be used directly or 
modified appropriately with relatively little effort. When this can be done, one can 
secure the advantage of guaranteed convergence on both separable and nonseparable 
problems. 

The various algorithms for finding linear discriminant functions presented in this 
chapter are summarized in Table 5.1. It is natural to ask which one is best, but none 
uniformly dominates or is uniformly dominated by all others. The choice depends 
upon such considerations as desired characteristics, ease of programming, the number 
of samples, and the dimensionality of the samples. If a linear discriminant function 
can yield a low error rate, any of these procedures, intelligently applied, can provide 
good performance. 
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Table 5.1: Descent Procedures for Obtaining Linear Discriminant Functions 


Name 


Criterion 


Algorithm 


Conditions 


Rema 


Fixed 
Increment 


Jp = X (ay) 


a(k +1) =a(k) + y" 


(a*(k)y* < 0) 


Finite 
linear 
soluti 
a(k) t 


Variable 


Increment 


a(k +1) = a(k) + n(k)y* 


(a‘(k)y" < b) 


Conv 
separ 
with 

Finite 
0<a 


Relaxation 


at(ky* 
a(k +1) = a(k) + noe" 


(a‘(k)y* < b) 


Conv 
separ 
with 

finite 
soluti 


Widrow-Hoff 
(LMS) 


a(k +1) = 
a(k) + n(k)(bx — a (k)y*)y" 


Tend: 


minir 


Stochastic 
Approx. 


Jm = E [(aty — z)?] 


a(k +1) = 
a(k) + n(k)(zx —a“()y?)y? 


a(k+1)= 
a(k) + R(k)(2(k) — a(k)*y")y* 


Invol 
numb 
drawı 
verge 
toas 
mizin 
vides 
imati 
discri 


Pseudo- 
inverse 


J, = [Ya — bl]? 


a= Yb 


Class 
speci; 
yield 
discri 
appre 
Bayes 


Ho-Kashyap 


J, = ||Ya — bl]? 


b(k + 1) = b(k) + n(e(k) + ļe(k)|) 
e(k) = Ya(k) — b(k) 


a(k) = Ytb(k) 


a(k). 
for ea 
verge 
separ 
but e 
samp 
non-s 


b(k + 1) = b(k) + n(e(k) + (le(*)I) 


a(k + 1) = a(k) + 7RY‘|e(k)| 


n(k) = 
le(k) |" YRY*|e(k)| 


le(k) | YRYYRY ‘Je(k)| 
is optimum; 


R sym., pos. def.; 
b(1) >0 


Finite 
linear 
if Y+ 
e(k) : 


are ni 


Linear 
Program- 
ming 


= eee 
E ys bo 


Simplex algorithm 


aly; +7 > bi 


b>0 


Finite 
both 

nonse 
usefu 
if sep 


Simplex algorithm 


aly; +7 > bi 
b>0 


Finiti 
both 

nonse 
usefu 
if sep 
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5.11 *Support Vector Machines 


We have seen how to train linear machines with margins. Support Vector Machines 
(SVMs) are motivated by many of the same considerations, but rely on preprocessing 
the data to represent patterns in a high dimension — typically much higher than the 
original feature space. With an appropriate nonlinear mapping p() to a sufficiently 
high dimension, data from two categories can always be separated by a hyperplane 
(Problem 27). Here we assume each pattern xy has been transformed to y, = p(Xx); 
we return to the choice of p() below. For each of the n patterns, k = 1,2,...,n, we let 
Zk = +1, according to whether pattern k is in wı or wa. A linear discriminant in an 
augmented y space is 


gly) = a'y, (104) 


where both the weight vector and the transformed pattern vector are augmented (by 
ao = Wo and yo = 1, respectively). Thus a separating hyperplane insures 


rgy 2L kaluan (105) 


much as was shown in Fig. 5.8. 

In Sect. ??, the margin was any positive distance from the decision hyperplane. 
The goal in training a Support Vector Machine is to find the separating hyperplane 
with the largest margin; we expect that the larger the margin, the better generalization 
of the classifier. As illustrated in Fig. 5.2 the distance from any hyperplane to a 
(transformed) pattern y is |g(y)|/||a||, and assuming that a positive margin b exists, 
Eq. 105 implies 


an; (106) 


the goal is to find the weight vector a that maximizes b. Of course, the solution 
vector can be scaled arbitrarily and still preserve the hyperplane, and thus to insure 
uniqueness we impose the constraint b ||a|| = 1; that is, we demand the solution to 
Eqs. 104 & 105 also minimize ||aļ|?. 

The support vectors are the (transformed) training patterns for which Eq. 105 rep- 
resents an equality — that is, the support vectors are (equally) close to the hyperplane 
(Fig. 5.19). The support vectors are the training samples that define the optimal sepa- 
rating hyperplane and are the most difficult patterns to classify. Informally speaking, 
they are the patterns most informative for the classification task. 

If N, denotes the total number of support vectors, then for n training patterns 
the expected value of the generalization error rate is bounded, according to 


nm Ns 
En [error rate] < En | (107) 
n 


where the expectation is over all training sets of size n drawn from the (stationary) 
distributions describing the categories. This bound is independent of the dimension- 
ality of the space of transformed vectors, determined by y(). We will return to this 
equation in Chap. ??, but for now we can understand this informally by means of 
the leave one out bound. Suppose we have n points in the training set, and train a 
Support Vector Machine on n — 1 of them, and test on the single remaining point. 
If that remaining point happens to be a support vector for the full n sample case, 
then there will be an error; otherwise, there will not. Note that if we can find a 


SUPPORT 
VECTOR 


LEAVE-ONE- 
OUT BOUND 
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transformation p() that well separates the data — so the expected number of support 
vectors is small — then Eq. 107 shows that the expected error rate will be lower. 


Ya 


Figure 5.19: Training a Support Vector Machine consists of finding the optimal hy- 
perplane, i.e., the one with the maximum distance from the nearest training patterns. 
The support vectors are those (nearest) patterns, a distance b from the hyperplane. 
The three support vectors are shown in solid dots. 


5.11.1 SVM training 


We now turn to the problem of training an SVM. The first step is, of course, to choose 
the nonlinear -functions that map the input to a higher dimensional space. Often 
this choice will be informed by the designer’s knowledge of the problem domain. In 
the absense of such information, one might choose to use polynomials, Gaussians or 
yet other basis functions. The dimensionality of the mapped space can be arbitrarily 
high (though in practice it may be limited by computational resources). 

We begin by recasting the problem of minimizing the magnitude of the weight 
vector constrained by the separation into an unconstrained problem by the method 
of Lagrange undetermined multipliers. Thus from Eq. 106 and our goal of minimizing 
||al|, we construct the functional 


1 n 
L(a,a) = llall? — >> ax(zeatys — 1). (108) 
k=1 


and seek to minimize L() with respect to the weight vector a, and maximize it with 
respect to the undetermined multipliers a, > 0. The last term in Eq. 108 expresses 
the goal of classifying the points correctly. It can be shown using the so-called Kuhn- 
Tucker construction (Problem 30) (also associated with Karush whose 1939 thesis 
addressed the same problem) that this optimization can be reformulated as maximiz- 
ing 


n 1 n 
L(a) => oi- AS (109) 
k=1 kj 


subject to the constraints 
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XO aak=0 ey SOR = Ten, (110) 
k=1 


given the training data. While these equations can be solved using quadratic pro- 
gramming, a number of alternate schemes have been devised (cf. Bibliography). 


Example 2: SVM for the XOR problem | 


The exclusive-OR is the simplest problem that cannot be solved using a linear 
discriminant operating directly on the features. The points k = 1,3 at x = (1,1)! 
and (—1,—1)' are in category wı (red in the figure), while k = 2,4 at x = (1,—1)' 
and (—1, 1)! are in wy (black in the figure). Following the approach of Support Vector 
Machines, we preprocess the features to map them to a higher dimension space where 
they can be linearly separated. While many y-functions could be used, here we use 
the simplest expansion up to second order: 1, 221, V2£2, V2x1x2, £? and x3, where 
the v2 is convenient for normalization. 


We seek to maximize Eq. 109, 


4 n 

1 t 
> Qk — 3 > AKAGZKZIY IVI 
k=1 kj 


subject to the constraints (Eq. 110) 


Qa, — a2 +Aa3 — &4 = 0 
O<a, k=1,2,3,4. 


It is clear from the symmetry of the problem that a; = ag and that az = a4 at the 
solution. While we could use iterative gradient descent as described in Sect. 5.9, for 
this small problem we can use analytic techniques instead. The solution is aj, = 1/8, 
for k = 1,2,3,4, and from the last term in Eq. 108 this implies that all four training 
patterns are support vectors — an unusual case due to the highly symmetric nature 
of the XOR problem. 

The final discriminant function is g(x) = g(a1,%2) = 1,12, and the decision 
hyperplane is defined by g = 0, which properly classifies all training patterns. The 
margin is easily computed from the solution ||a|| and is found to be b = 1/||al| = V2. 
The figure at the right shows the margin projected into two dimensions of the five 
dimensional transformed space. Problem 28 asks you to consider this margin as viewed 
in other two-dimensional projected sub-spaces. 


An important benefit of the Support Vector Machine approach is that the com- 
plexity of the resulting classifier is characterized by the number of support vectors — 
independent of the dimensionality of the transformed space. This 
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The XOR problem in the original xı — x2 feature space is shown at the left; the two 
red patterns are in category wı and the two black ones in w2. These four training 
patterns x are mapped to a six-dimensional space by 1, V2x1, V2x2, V2x1x2, x? and 
x3. In this space, the optimal hyperplane is found to be g(x1, £2) = 1112 = 0 and the 
margin is b = y2. A two-dimensional projection of this space is shown at the right. 
The hyperplanes through the support vectors are 22,72 = +1, and correspond to 
the hyperbolas 1112 = +1 in the original feature space, as shown. 


5.12 Multicategory Generalizations 


5.12.1 Kesler’s Construction 


There is no uniform way to extend all of the two-category procedures we have discussed 
to the multicategory case. In Sect. 5.2.2 we defined a multicategory classifier called a 
linear machine which classifies a pattern by computing c linear discriminant functions 


g(x) = wx + wio i=1,...,¢, 


and assigning x to the category corresponding to the largest discriminant. This is 
a natural generalization for the multiclass case, particularly in view of the results 
of Chap. ?? for the multivariate normal problem. It can be extended simply to 
generalized linear discriminant functions by letting y(x) be a d-dimensional vector of 
functions of x, and by writing 


gi(x) = aly L= Vyas (111) 


where again x is assigned to w; if gi(x) > g;(x) for all j Æ i. 

The generalization of our procedures from a two-category linear classifier to a 
multicategory linear machine is simplest in the linearly-separable case. Suppose that 
we have a set of labelled samples yj, y2,...,¥n, with nı in the subset Y; labelled w1, 
na in the subset Ya labelled we,..., and ne in the subset Y, labelled we. We say that 
this set is linearly separable if there exists a linear machine that classifies all of them 
correctly. That is, if these samples are linearly separable, then there exists a set of 
weight vectors â1,..., âe such that if yz € Vi, then 


Alyn > Ajyn (112) 
for all j 4 i. 
One of the pleasant things about this definition is that it is possible to manipulate 
these inequalities and reduce the multicategory problem to the two-category case. 
Suppose for the moment that y € 1, so that Eq. 112 becomes 
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diy, — aly, >0, j=2,..,C. (113) 


This set of c — 1 inequalities can be thought of as requiring that the cd-dimensional 
weight vector 


y y y 

=y 0 0 

Mia = o > Mmg = “YJ, ee, Me = 0 
0 0 —y 


In other words, each n4; corresponds to “normalizing” the patterns in wı and wj. 
More generally, if y € Y;, we construct (c — 1)cd-dimensional training samples Nij by 
partitioning nij into cd-dimensional subvectors, with the ith subvector being y, the 
jth being —y, and all others being zero. Clearly, if AN > 0 for j # i, then the 
linear machine corresponding to the components of & classifies y correctly. 

This so-called Kesler construction multiplies the dimensionality of the data by c 
and the number of samples by c— 1, which does not make its direct use attractive. 
Its importance resides in the fact that it allows us to convert many multicategory 
error-correction procedures to two-category procedures for the purpose of obtaining 
a convergence proof. 


5.12.2 Convergence of the Fixed-Increment Rule 


We now use use Kesler’s construction to prove convergence for a generalization of the 
fixed-increment rule for a linear machine. Suppose that we have a set of n linearly- 
separable samples y;,..., yn, and we use them to form an infinite sequence in which 
every sample appears infinitely often. Let Lẹ denote the linear machine whose weight 
vectors are az(k),...,ac(k). Starting with an arbitrary initial linear machine L1, we 
want to use the sequence of samples to construct a sequence of linear machines that 
converges to a solution machine, one that classifies all of the samples correctly. We 
shall propose an error-correction rule in which weight changes are made if and only 
if the present linear machine misclassifies a sample. Let y” denote the kth sample 
requiring correction, and suppose that y* € J;. Since y” requires correction, there 
must be at least one j Æ i for which 


aj(k)y" < aj(k)y?. (114) 


Then the fixed-increment rule for correcting Lk is 
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a(k+1) = a(k)+y* 
aj(k+1) = aj(k)—y* (115) 
a(k+1) = a(k), [Ai andl Fj. 


That is, the weight vector for the desired category is incremented by the pattern, the 
weight vector for the incorrectly chosen category is decremented, and all other weights 
are left unchanged (Problem 33, Computer exercise 12). 

We shall now show that this rule must lead to a solution machine after a finite 
number of corrections. The proof is simple. For each linear machine Ly there corre- 
sponds a weight vector 


ay (k) 
ap = : 
a.(k) 


For each sample y € Y; there are c— 1 samples nij formed as described in Sect. ??. 
In particular, corresponding to the vector y" satisfying Eq. 114 there is a vector 


satisfying 


at (k) nf, < 0. 
Furthermore, the fixed-increment rule for correcting Lẹ is the fixed-increment rule for 
correcting a(k), viz., 


a(k+1)=a(k)+ ne. 


Thus, we have obtained a complete correspondence between the multicategory case 
and the two-category case, in which the multicategory procedure produces a sequence 
of samples 7!,1?,...,n*,... and a sequence of weight vectors a1, Q2,...,Q%,... By our 
results for the the two-cateogry case, this latter sequence can not be infinite, but must 
terminate in a solution vector. Hence, the sequence L1, Lo,..., Ly,... must terminate 
in a solution machine after a finite number of corrections. 

This use of Kesler’s construction to establish equivalences between multicategory 
and two-category procedures is a powerful theoretical tool. It can be used to extend 
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all of our results for the Perceptron and relaxation procedures to the multicategory 
case, and applies as well to the error-correction rules for the method of potential 
functions (Problem ??). Unfortunately, it is not as directly useful in generalizing the 
MSE or the linear programming approaches. 


5.12.3 Generalizations for MSE Procedures 


Perhaps the simplest way to obtain a natural generalization of the MSE procedures 
to the multiclass case is to consider the problem as a set of c two-class problems. The 
ith problem is to obtain a weight vector a; that is minimum-squared-error solution to 
the equations 


| 


ay = 1foalye); 
aty = —1forall y ¢ Yj. 


In view of the results of Sect. 5.8.3 the number of samples is very large we will obtain 
a minimum mean-squared-error approximation to the Bayes discriminant function 


P(w;|x) — P(not w;|x) = 2P(w;|x) — 1. 


This observation has two immediate consequences. First, it suggests a modification 
in which we seek a weight vector a; that is a minimum-squared-error solution to the 
equations 


(116) 


ay = 1 for all y € Y; 
ay = 0 for all y € Y, 


so that aty will be a minimum mean-squared-error approximation to P(w;|x). Second, 
it justifies the use of the resulting discriminant functions in a linear machine, in which 
we assign y to w; if ajy > ajy for all ¡A i. 

The pseudoinverse solution to the multiclass MSE problem can be written in a 
form analogous to the form for the two-class case. Let Y be the n-by-d matrix of 
training samples, which we assume to be partitioned as 


Y; 
Yo 
(117) 


with the samples labelled w; comprising the rows of Y;. Similarly, let A be the d-by-c 
matrix of weight vectors 


A = [aj az +- ac), (118) 
and let B be the n-by-c matrix 
Bı 
B2 
B= 5 (119) 
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where all of the elements of B; are zero except for those in the ith column, which 
are unity. Then the trace of the “squared” error matrix (YA — B)’ x (YA — B) is 
minimized by the solution* 


A= YB, (120) 


where, as usual, Y? is the pseudoinverse of Y. 

This result can be generalized in a theoretically interesting fashion. Let A;; be 
the loss incurred for deciding w; when the true state of nature is w;, and let the jth 
submatrix of B be given by 


Aig Ave Aj ] 
Aig Ang = Ae l 

B; =- : : n; FS nas (121) 
Aij A2j SE Acj | 


Then, as the number of samples approaches infinity, the solution A = Y'B yields dis- 
criminant functions aty which provide a minimum-mean-square-error approximation 
to the Bayes discriminant function 


Joi = — 5 AijPlw|x). (122) 
j=l 


The proof of this is a direct extension of the proof given in Sect. 5.8.3 (Problem 34). 


Summary 


This chapter considers discriminant functions that are a linear function of a set of 
parameters, generally called weights. In all two-category cases, such discriminants 
lead to hyperplane decision boundaries, either in the feature space itself, or in a 
space where the features have been mapped by a nonlinear function (general linear 
discriminants). 

In broad overview, techniques such as the Perceptron algorithm adjust the param- 
eters to increase the inner product with patterns in category w; and decrease the inner 
product with those in w2. A very general approach is to form some criterion function 
and perform gradient descent. Different creiterion functions have different strengths 
and weaknesses related to computation and convergence, none uniformly dominates 
the others. One can use linear algebra to solve for the weights (parameters) directly, 
by means of pseudoinverse matrixes for small problems. 

In Support Vector Machines, the input is mapped by a nonlinear function to a high- 
dimensional space, and the optimal hyperplane found, the one that has the largest 
margin. The support vectors are those (transformed) patterns that determine the 
margin; they are informally the hardest patterns to classify, and the most informative 
ones for designing the classifier. An upper bound on expected error rate of the classifier 
depends linearly upon the expected number of support vectors. 

For multi-category problems, the linear machines create decision boundaries con- 
sisting of sections of such hyperplanes. One can prove convergence of multi-category 


* Tf we let b; denote the ith column of B, the trace of (YA — B)'(YA — B) is equal to the sum 
of the squared lengths of the error vectors Ya; — b;. The solution A = YİB not only minimizes 
this sum, but it also minimizes each term in the sum. 
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algorithms by first converting them to two-category algorithms and using the two- 
category proofs. The simplex algorithm finds the optimimun of a linear function 
subject to (inequality) constraints, and can be used for training linear classifiers. 

Linear discriminants, while useful, are not sufficiently general for arbitrary chal- 
lenging pattern recognition problems (such as those involving multi-modal densities) 
unless an appropriate nonlinear mapping (y function) can be found. In this chapter 
we have not considered any principled approaches to choosing such functions, but 
turn to that topic in Chap. ??. 


Bibliographical and Historical Remarks 


Because linear discriminant functions are so amenable to analysis, far more papers 
have been written about them than the subject deserves. Historically, all of this work 
begins with the classic paper by Ronald A. Fisher [4]. The application of linear dis- 
criminant function to pattern classification was well described in [7], which posed the 
problem of optimal (minimum-risk) linear discriminant, and proposed plausible gra- 
dient descient procedures to determine a solution from samples. Unfortunately, little 
can be said about such procedures without knowing the underlying distributions, and 
even then the situation is analytically complex. The design of multicategory classifiers 
using two-category procedures stems from [12]. Minsky and Papert’s Perceptrons 
[11] was influential in pointing out the weaknesses of linear classifiers — weaknesses 
that were overcome by the methods we shall study in Chap. ??. The Winnow algo- 
rithms [8] in the error-free case and [9, 6] and subsequent work in the general case 
have been useful in the computational learning community, as they allow one to derive 
convergence bounds. 

While this work was statistically oriented, many of the pattern recognition papers 
that appeared in the late 1950s and early 1960s adopted other viewpoints. One 
viewpoint was that of neural networks, in which individual neurons were modelled as 
threshold elements, two-category linear machines — work that had its origins in the 
famous paper by McCulloch and Pitts [10]. 

As linear machines have been applied to larger and larger data sets in higher and 
higher dimensions, the computational burden of linear programming [2] has made this 
approach less popular. Stochastic approximations, e.g, [15], 

An early paper on the key ideas in Support Vector Machines is [1]. A more 
extensive treatment, including complexity control, can be found in [14] — material 
we shall visit in Chap. ??. A readable presentation of the method is [3], which provided 
the inspiration behind our Example 2. The Kuhn-Tucker construction, used in the 
SVM training method described in the text and explored in Problem 30, is from [5] 
and used in [13]. The fundamental result is that exactly one of the following three 
cases holds. 1) The original (primal) conditions have an optimal solution; in that case 
the dual cases do too, and their objective values are equal, or 2) the primal conditions 
are infeasible; in that case the dual is either unbounded or itself infeasible, or 3) the 
primal conditions are unbounded; in that case the dual is infeasible. 


Problems 


Q Section 5.2 


58 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 


1. Consider a linear machine with discriminant functions g;(x) = w'x + wio, i = 
1,...,c. Show that the decision regions are convex by showing that if x; € R; and 
X € Ri then Ax, + (1 — A)xa ER; if0< A< 1. 

2. Figure 5.3 illustrates the two most popular methods for designing a c-category 
classifier from linear boundary segments. Another method is to save the full (5) 
linear w;/w; boundaries, and classify any point by taking a vote based on all these 
boundaries. Prove whether the resulting decision regions must be convex. If they need 
not be convex, construct a non-pathological example yielding at least one non-convex 
decision region. 

3. Consider the hyperplane used for discriminant functions. 


(a) Show that the distance from the hyperplane g(x) = wtx + wo = 0 to the point 
Xa is |g(Xq)|/||w|| by minimizing [lx — x,||? subject to the constraint g(x) = 0. 


(b) Show that the projection of x, onto the hyperplane is given by 


_ g(Xa) 
Iwl 


Xp = Xa 


4. Consider the three-category linear machine with discriminant functions g;(x) = 
wix + Wid; — 1, 2, 3. 


(a) For the special case where x is two-dimensional and the threshold weights wio 
are zero, sketch the weight vectors with their tails at the origin, the three lines 
joining their heads, and the decision boundaries. 


(b) How does this sketch change when a constant vector c is added to each of the 
three weight vectors? 


5. In the multicategory case, a set of samples is said to be linearly separable if there 
exists a linear machine that can classify them all correctly. If any samples labelled 
w can be separated from all others by a single hyperplane, we shall say the samples 


TOTAL are totally linearly separable. Show that totally linearly separable samples must be 

LINEAR linearly separable, but that the converse need not be true. (Hint: For the converse, 

SEPARABILITY consider a case in which a linear machine like the one in Problem 4 separates the 
samples. ) 

PAIRWISE 6. A set of samples is said to be pairwise linearly separable if there exist c(c — 1)/2 

LINEAR hyperplanes H;¿ such that H;; separates samples labelled w; from samples wj. Show 


SEPARABILITY that a pairwise-linearly-separable set of patterns may not be linearly separable. 

7. Let [y1,..., yn) be a finite set of linearly separable training samples, and let a be 
called a solution vector if aty; > 0 for all i. Show that the minimum-length solution 
vector is unique. (Hint: Consider the effect of averaging two solution vectors.) 

CONVEX 8. The convex hull of a set of vectors x;,2 = 1,...,n is the set of all vectors of the 
HULL form 


n 
x= > QiXi, 
i=l 


where the coefficients a; are nonnegative and sum to one. Given two sets of vectors, 
show that either they are linearly separable or their convex hulls intersect. (Hint: 
Suppose that both statements are true, and consider the classification of a point in 
the intersection of the convex hulls.) 


PIECEWISE 
LINEAR 
MACHINE 
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9. A classifier is said to be a piecewise linear machine if its discriminant functions 
have the form 


where 


t 

Jij (X) = W;¡¿X + Wijo, 

(a) Indicate how a piecewise linear machine can be viewed in terms of a linear 
machine for classifying subclasses of patterns. 


(b) Show that the decision regions of a piecewise linear machine can be nonconvex 
and even multiply connected. 


(c) Sketch a plot of g;;(x) for a one-dimensional example in which nı = 2 and 
na = 1 to illustrate your answer to part (b). 


10. Let the d components of x be either 0 or 1. Suppose we assign x to w, if the 
number of non-zero components of x is odd, and to wa otherwise. (This is called the 
d-bit parity problem.) 


(a) Show that this dichotomy is not linearly separable if d > 1. 


(b) Show that this problem can be solved by a piecewise linear machine with d+ 1 
weight vectors w;; (see Problem 9). (Hint: Consider vectors of the form w;; = 
diz (l; Vesey 10) 


Q Section 5.3 


11. Consider the quadratic discriminant function (Eq. 4) 


d d d 
g(x) = Wo + 5 Witi + 5 5 WijTiLj, 
i=1 


i=1 j=1 


and define the symmetric, nonsingular matrix W = [w;;]. Show that the basic 
character of the decision boundary can be described in terms of the scaled matrix 
W = W/(w*W-w — 410) as follows: 


(a) If W x I (the identity matrix), then the decision boundary is a hypersphere. 
(b) If W is positive definite, then the decision boundary is a hyperellipsoid. 


(c) If some eigenvalues of W are positive and some negative, then the decision 
boundary is a hyperhyperboloid. 


5 L 2 0 
(d) Suppose w = 2 and W =| 2 5 1 J. What is the character of 
-3 0 1 3 


the solution? 
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2 1 2 3 
(e) Repeat part (d) for w= | -1 | andW=] 2 0 4 
3 3 4 -5 
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12. Derive Eq. 14, where J(-) depends on the iteration step k. 
13. Consider the sum-of-squared-error criterion function (Eq. 43), 


n 


J.(a) = Y (aty: —b;)?. 


i=l 


Let b; = b and consider the following six training points: 


a: (1,5), (2,9), (-5, -3) 
w2: (2,-3), (-1,-4), (0, 2)! 


(a) Calculate the Hessian matrix for this problem. 


(b) Assuming the quadratic criterion function calculate the optimal learning rate 7. 
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14. In the convergence proof for the Perceptron algorithm (Theorem 5.1) the scale 
factor a was taken to be 8?/y. 


(a) Using the notation of Sect. 5.5, show that if a is greater than 8?/(2y) the 
maximum number of corrections is given by 


e — llas — eal? 
2 2ay — P2 * 


(b) If a; = 0, what value of a minimizes ko? 


15. Modify the convergence proof given in Sect. 5.5.2 (Theorem 5.1) to prove the 
convergence of the following correction procedure: starting with an arbitrary initial 
weight vector a), correct a(k) according to 


a(k +1) = a(k) + n(k)y*, 


if and only if a*(k)y* fails to exceed the margin b, where n(k) is bounded by 0 < na < 
n(k) < q < oo. What happens if b is negative? 
16. Let {y1,..., yn) be a finite set of linearly separable samples in d dimensions. 


(a) Suggest an exhaustive procedure that will find a separating vector in a finite 
number of steps. (Hint: Consider weight vectors whose components are integer 
valued.) 
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(b) What is the computational complexity of your procedure? 


17. Consider the criterion function 


J(a)= Y (ayb) 


yeY(a) 


where Y(a) is the set of samples for which aty < b. Suppose that yı is the only 
sample in Y(a(k)). Show that VJ,(a(k)) = 2(a*(k)y1 — b)y1 and that the matrix of 
second partial derivatives is given by D = 2y,y!. Use this to show that when the 
optimal 7(k) is used in Eq. ?? the gradient descent algorithm yields 


a(k +1) = a(k) + => 
18. Given the conditions in Eqs. 28 — 30, show that a(k) in the variable increment 
descent rule indeed converges for aty; > b for all i. 
Q Section 5.6 
19. Sketch a figure to illustrate the proof in Sec. 5.6.2. Be sure to take a general 


case, and label all variables. 


Q Section 5.7 


Q Section 5.8 

20. Show that the scale factor a in the MSE solution corresponding to Fisher’s linear 
discriminant (Sect. ??) is given by 

-1 


Q = [1 + ——= (m; == ma)'S;/ (mı = ma) 


21. Generalize the results of Sect. 5.8.3 to show that the vector a that minimizes 
the criterion function 


J3(a) = 5 (aty — (A21 — A11))? 4 5 (ay — (A12 — Az2))? 


yevi yeye 


provides asymptotically a minimum-mean-squared-error approximation to the Bayes 
discriminant function (A21 SS A11)P(w1|x) = Oya = A22)P(wa|x). 

22. Consider the criterion function Jm(a) = €[(aty(x) — 2)?] and the Bayes discrim- 
inant function go(x). 


(a) Show that 
Jm = El(aly — go)”| — 2€[(a'y — go)(z — 90)] + €l(2 — 90)”. 


(b) Use the fact that the conditional mean of z is go(x) in showing that the a that 
minimizes Jm also minimizes E[(aty — go)?] . 
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23. A scalar analog of the relation Rey = Rz‘ + yxy}, used in stochastic approxi- 
mation is 7~(k + 1) = n+ (k) + yf. 


(a) Show that this has the closed form solution 
n(Q) 


n(k) = — 2 
1+0) E 


(b) Assume that (1) > 0 and 0 < a < y? < b < oo, and indicate why this sequence 
of coefficients will satisfy X` n(k) > œ and Y n(k)? => L < oœ. 


24. Show that for the Widrow-Hoff or LMS rule that if n(k) = n(1)/k then the 
sequence of weight vectors converges to a limiting vector a satisfying Y'(Ya—b) = 0 
(Eq. 61). 
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25. Consider the following six data points: 


wi: (1,2%, (2, -4), (23, -1) 
w2: (2,4)', (21,-5), (5, 0) 


(a) Are they linearly separable? 


(b) Using the approach in the text, assume R = I, the identity matrix, and calculate 
the optimal learning rate y by Eq. 85. 
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26. The linear programming problem formulated in Sect. 5.10.2 involved minimizing 
a single artificial variable 7 under the constraints aty; +T > b; and T > 0. Show that 
the resulting weight vector minimizes the criterion function 


= bi —aty;]. 
Jo) aa De 


Q Section 5.11 


27. Discuss qualitatively why if samples from two categories are distinct (i.e., no 
feature point is labelled by both categories), there always exists a nonlinear mapping 
to a higher dimension that leaves the points linearly separable. 

28. The figure in Example 2 shows the maximum margin for a Support Vector 
Machine applied to the exclusive-OR problem mapped to a five-dimensional space. 
That figure shows the training patterns and contours of the discriminant function, as 
projected in the two-dimensional subspace defined by the features 2x, and 22,25. 
Ignore the constant feature, and consider the other four features. For each of the 
e — 1 = 5 pairs of features other than the one shown in the Example, show the 
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patterns and the lines corresponding to the discriminant g = +1. Are the margins in 
your figures the same? Explain why or why not? 

29. Consider a Support Vector Machine and the following training data from two 
categories: 


category | 11 | £2 
Wy 1 1 
Wy 2 2 
Wy 2 0 
wa 0 0 
wa 1 0 
Ya 0 1 


(a) Plot these six training points, and construct by inspection the weight vector for 
the optimal hyperplane, and the optimal margin. 


(b) What are the support vectors? 


(c) Construct the solution in the dual space by finding the Lagrange undetermined 
multipliers, ,. Compare your result to that in part (a). 


30. This problem asks you to follow the Kuhn-Tucker theorem to convert the con- 
strained optimization problem in Support Vector Machines into a dual, unconstrained 
one. For SVMs, the goal is to find the minimum length weight vector a subject to 
the (classification) constraints 


zp yy > 1 k= 1, ..., 1, 


where z, = +1 indicates the target getegory of each of the n patterns yọ. Note that 
a and y are augmented (by ay and yo = 1, respectively). 


(a) Consider the unconstrained optimization associated with SVMS: 


1 n 
L(a, a) = 3llalP — 5 aplzkatyr — 1]. 
k=1 


In the space determined by the components of a, and the n (scalar) undeter- 
mined multipliers œg, the desired solution is a saddle point, rather than a global 
maximum or minimum. Explain. 


(b) Next eliminate the dependency of this (“primal”) functional upon a, i.e., refor- 
mulated the optimization in a dual form, by the following steps. Note that at 
the saddle point of the primal functional, we have 
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(c) Prove that at this inflection point, the optimal hyperplane is a linear combina- 
tion of the training vectors: 


n 
* * 
a = > OLZLY k- 


k=1 


(d) According to the Kuhn-Tucker theorem (cf. Bibliography), an undetermined 
multiplier a% is non-zero only if the corresponding sample yy satisfies z,a'y;, = 
0. Show that this can be expressed as 


az[z,a"y,—1)=0, k=1,...,n. 
(The samples where aj are nonzero, i.e., ayy = 1, are the support vectors.) 


(e) Use the results from parts (b) — (c) to eliminate the weight vector in the func- 
tional, and thereby construct the dual functional 


7 1 n n 
L(a,a) = zllalf — Y arzra'yk + Y ar. 
k=1 k=1 
(£) Substitute the solution a* from part (c) to find the dual functional 


n 
> =5 Soa, Ok 25 Zk (y5Y1) a 
jk=1 


g=1 


DE 


31. Repeat Example 2. 
Q Section 5.12 


32. Suppose that for each two-dimensional training point y; in category wi there is 
a corresponding (symmetric) point in wa at —y;. 


(a) Prove that a separating hyperplane (should one exist) or LMS solution must go 
through the origin. 


(b) Consider such a symmetric, six-point problem with the following points: 


Find the matematical conditions on y such that the LMS solution for this prob- 
lem not give a separating hyperplane. 


(c) Generalize your result as follows. Suppose w, consists of y, and yə (known) 
and the symmetric versions in w2. What is the condition on y3 such that the 
LMS solution does not separate the points. 
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33. Write pseudocode for a fixed increment multicategory algorithm based on 
Eq. 115. Discuss the strenths and weakness of such an implementation. 

34. Generalize the discussion in Sect. 5.8.3 in order to prove that the solution 
derived from Eq. 120 provieds a minimum-mean-square-error approximation to the 
Bayes discriminant function given in Eq. 122. 


Computer exercises 


Several of the exercises use the data in the following table. 
Wy W2 W3 Wa 
sample Ti La Tı Ta Ly T2 Ly T2 
1 0.1 11| 71 4.2 |-3.0 -2.9 | -20 -8.4 
6.8 7.1 |-1.4 -43| 0.5 87] -8.9 0.2 
-3.5 -41 |] 45 O00) 29 2.1 | -4.2 -7.7 
2.0 2.7) 63 1.6 |-0.1 5.2 |-8.5 -3.2 
41 28| 42 19 -4.0 2.2 |-6.7 -4.0 
3.1 50] 1.4 -3.2 | -1.3 3.7 |-0.5 -9.2 
-0.8 -1.3 | 24 -4.0 | -3.4 6.2 | -5.3 -6.7 
0.9 12| 25 -6.1 | -4.1 341] -8.7 -6.4 
9 50 64) 84 37 | -5.1 1.6 |-7.1 -9.7 
10 39 40] 41 -22| 19 5.1 |-8.0 -6.3 
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CONDO KB W bd 


1. Consider basic gradient descent (Algorithm 1) and Newton’s algorithm (Algo- 
rithm 2) applied to the data in the tables. 


(a) Apply both to the three-dimensional data in categories wı and w3. For the 
gradient descent use y(k) = 0.1. Plot the criterion function as function of the 
iteration number. 


(b) Estimate the total number of mathematical operations in the two algorithms. 


(c) Plot the convergence time versus learning rate. What is the minimum learning 
rate that fails to lead to convergence? 
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2. Write a program to implement the Perceptron algorithm. 


(a) Starting with a = 0, apply your program to the training data from w; and wy. 
Note the number of iterations required for convergence. 


(b) Apply your program to w3 and wa. Again, note the number of iterations required 
for convergence. 


(c) Explain the difference between the iterations required in the two cases. 


3. The Pocket algorithm uses the criterion of longest sequence of correctly classified 
points, and can be used in conjunction a number of basic learning algorithms. For 
instance, one use the Pocket algorithm in conjunction with the Perceptron algorithm 
in a sort of ratchet scheme as follows. There are two sets of weights, one for the normal 


POCKET 
ALGORITHM 
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Perceptron algorithm, and a separate one (not directly used for training) which is kept 
“in your pocket.” Both are randomly chosen at the start. The “pocket” weights are 
tested on the full data set to find the longest run of patterns properly classified. (At 
the beginning, this run will be short.) The Perceptron weights are trained as usual, 
but after every weight update (or after some finite number of such weight updates), 
the Perceptron weight is tested on data points, randomly selected, to determine the 
longest run of properly classified points. If this length is greater than the pocket 
weights, the Perceptron weights replace the pocket weights, and perceptron training 
continues. In this way, the poscket weights continually improve, classifying longer and 
longer runs of randomly selected points. 


(a) Write a pocket algorithm to be employed with Perceptron algorithm. 


(b) Apply it to the data from w; and w3. How often are the pocket weights updated? 


4. Start with a randomly chosen a, Calculate 3? (Eq. 21 At the end of training 
calculate y (Eq. 22). Verify ko (Eq. 25). 

5. Show that the first xx points of categories wy and wxg. Construct by hand 
a nonlinear mapping of the feature space to make them linearly separable. Train a 
Perceptron classifier on them. 

6. Consider a version of the Balanced Winnow training algorithm (Algorithm 7). 
Classification of test data is given by line 2. Compare the converge rate of Balanced 
Winnow with the fixed-increment, single-sample Perceptron (Algorithm 4) on a prob- 
lem with large number of redundant features, as follows. 


(a) Generate a training set of 2000 100-dimensional patterns (1000 from each of two 
categories) in which only the first ten features are informative, in the following 
way. For patterns in category w1, each of the first ten features are chosen ran- 
domly and uniformly from the range +1 < x; < 2, for i = 1,...,10. Conversely, 
for patterns in wa, each of the first ten features are chosen randomly and uni- 
formly from the range —2 < x; < —1. All other features from both categories 
are chosen from the range —2 < x; < +2. 


(b) Construct by hand the obvious separating hyperplane. 


(c) Adjust the learning rates so that your two algorithms have roughly the same 
convergence rate on the full training set when only the first ten features are 
considered. That is, assume each of the 2000 training patterns consists of just 
the first ten features. 


(d) Now apply your two algorithms to 2000 50-dimensional patterns, in which the 
first ten features are informative and the remaining 40 are not. Plot the total 
number of errors versus iteration. 


(e) Now apply your two algorithms to the full training set of 2000 100-dimensional 
patterns. 


(£) Summarize your answers to parts (c) - (e). 
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7. Consider relaxation methods. 
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(a) Implement batch relaxation with margin (Algorithm 8), set b = 0.1 and a(1) = 0 
and apply it to the data in wı and w3. Plot the criterion function as a function 
of the number of passes through the training set. 


(b) Repeat for b = 0.5 and a(1) = 0. Explain qualitatively any difference you find 
in the convergence rates. 


(c) Modify your program to use single sample learning. Again, plot the criterion as 
a function of the number of passes through the training set. 


(d) Discuss any differences, being sure to consider the learning rate. 


Q Section 5.8 

8. Write a single-sample relaxation algorithm and use Eq. ?? for updating R. Apply 
your program to the data in wa and wg. 

Q Section 5.9 

9. Implement the Ho-Kashyap algorithm (Algorithm 11) and apply to the data in 
categories w; and w3. Repeat for categories w4 and wa. 

@® Section 5.10 

10. example where the LMS rule need not give the separating vector, even if one 
exists 


Q Section 5.11 


11. Support Vector Machine xxx. Apply it to the classification of w3 and w4. 
Q Section 5.12 


12. Write a programto implement a multicategory generalization of basic single- 
sample relaxation without margin (Algorithm ??). 


(a) Apply it to the data in all four categories in the table. 


(b) Use your algorithm in a two-category mode to form w;/notw; boundaries for i = 
1,2,3,4. Find any regions whose categorization by your system is ambiguous. 
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Chapter 6 


Multilayer Neural Networks 


6.1 Introduction 


n the previous chapter we saw a number of methods for training classifiers con- 
I sisting of input units connected by modifiable weights to output units. The LMS 
algorithm, in particular, provided a powerful gradient descent method for reducing 
the error, even when the patterns are not linearly separable. Unfortunately, the class 
of solutions that can be obtained from such networks — hyperplane discriminants 
— while surprisingly good on a range or real-world problems, is simply not general 
enough in demanding applications: there are many problems for which linear discrim- 
inants are insufficient for minimum error. 

With a clever choice of nonlinear y functions, however, we can obtain arbitrary 
decisions, in particular the one leading to minimum error. The central difficulty is, 
naturally, choosing the appropriate nonlinear functions. One brute force approach 
might be to choose a complete basis set (all polynomials, say) but this will not work; 
such a classifier would have too many free parameters to be determined from a limited 
number of training patterns (Chap. ??). Alternatively, we may have prior knowledge 
relevant to the classification problem and this might guide our choice of nonlinearity. 
In the absence of such information, up to now we have seen no principled or auto- 
matic method for finding the nonlinearities. What we seek, then, is a way to learn 
the nonlinearity at the same time as the linear discriminant. This is the approach 
of multilayer neural networks (also called multilayer Perceptrons): the parameters 
governing the nonlinear mapping are learned at the same time as those governing the 
linear discriminant. 

We shall revisit the limitations of the two-layer networks of the previous chapter,* 
and see how three-layer (and four-layer...) nets overcome those drawbacks — indeed 
how such multilayer networks can, at least in principle, provide the optimal solution 
to an arbitrary classification problem. There is nothing particularly magical about 
multilayer neural networks; at base they implement linear discriminants, but in a space 
where the inputs have been mapped nonlinearly. The key power provided by such 
networks is that they admit fairly simple algorithms where the form of the nonlinearity 


* Some authors describe such networks as single layer networks because they have only one layer of 
modifiable weights, but we shall instead refer to them based on the number of layers of units. 
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can be learned from training data. The models are thus extremely powerful, have nice 
theoretical properties, and apply well to a vast array of real-world applications. 
One of the most popular methods for training such multilayer networks is based 


BACKPROPAGATOONWradient descent in error — the backpropagation algorithm (or generalized delta 


REGULAR- 
IZATION 


HIDDEN 
LAYER 


rule), a natural extension of the LMS algorithm. We shall study backpropagation 
in depth, first of all because it is powerful, useful and relatively easy to understand, 
but also because many other training methods can be seen as modifications of it. 
The backpropagation training method is simple even for complex models (networks) 
having hundreds or thousands of parameters. In part because of the intuitive graphical 
representation and the simplicity of design of these models, practitioners can test 
different models quickly and easily; neural networks are thus a sort of “poor person’s” 
technique for doing statistical pattern recognition with complicated models. The 
conceptual and algorithmic simplicity of backpropagation, along with its manifest 
success on many real-world problems, help to explain why it is a mainstay in adaptive 
pattern recognition. 

While the basic theory of backpropagation is simple, a number of tricks — some 
a bit subtle — are often used to improve performance and increase training speed. 
Choices involving the scaling of input values and initial weights, desired output values, 
and more can be made based on an analysis of networks and their function. We shall 
also discuss alternate training schemes, for instance ones that are faster, or adjust 
their complexity automatically in response to training data. 

Network architecture or topology plays an important role for neural net classifi- 
cation, and the optimal topology will depend upon the problem at hand. It is here 
that another great benefit of networks becomes apparent: often knowledge of the 
problem domain which might be of an informal or heuristic nature can be easily in- 
corporated into network architectures through choices in the number of hidden layers, 
units, feedback connections, and so on. Thus setting the topology of the network is 
heuristic model selection. The practical ease in selecting models (network topologies) 
and estimating parameters (training via backpropagation) enable classifier designers 
to try out alternate models fairly simply. 

A deep problem in the use of neural network techniques involves regularization, 
complexity adjustment, or model selection, that is, selecting (or adjusting) the com- 
plexity of the network. Whereas the number of inputs and outputs is given by the 
feature space and number of categories, the total number of weights or parameters in 
the network is not — or at least not directly. If too many free parameters are used, 
generalization will be poor; conversely if too few parameters are used, the training 
data cannot be learned adequately. How shall we adjust the complexity to achieve 
the best generalization? We shall explore a number of methods for complexity ad- 
justment, and return in Chap. ?? to their theoretical foundations. 

It is crucial to remember that neural networks do not exempt designers from 
intimate knowledge of the data and problem domain. Networks provide a powerful 
and speedy tool for building classifiers, and as with any tool or technique one gains 
intuition and expertise through analysis and repeated experimentation over a broad 
range of problems. 


6.2 Feedforward operation and classification 


Figure 6.1 shows a simple three-layer neural network. This one consists of an input 
layer (having two input units), a hidden layer with (two hidden units)* and an output 
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layer (a single unit), interconnected by modifiable weights, represented by links be- 

tween layers. There is, furthermore, a single bias unit that is connected to each unit BIAS 
other than the input units. The function of units is loosely based on properties of bio- 

logical neurons, and hence they are sometimes called “neurons.” We are interested in | NEURON 
the use of such networks for pattern recognition, where the input units represent the 
components of a feature vector (to be learned or to be classified) and signals emitted 

by output units will be discriminant functions used for classification. 


* We call any units that are neither input nor output units “hidden” because their activations are 
not directly “seen” by the external environment, i.e., the input or output. 


RECALL 


NET 
ACTIVATION 


SYNAPSE 
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We can clarify our notation and describe the feedforward (or classification or recall) 
operation of such a network on what is perhaps the simplest nonlinear problem: the 
exclusive-OR (XOR) problem (Fig. 6.1); a three-layer network can indeed solve this 
problem whereas a linear machine operating directly on the features cannot. 

Each two-dimensional input vector is presented to the input layer, and the output 
of each input unit equals the corresponding component in the vector. Each hidden 
unit performs the weighted sum of its inputs to form its (scalar) net activation or 
simply net. That is, the net activation is the inner product of the inputs with the 
weights at the hidden unit. For simplicity, we augment both the input vector (i.e., 
append a feature value xo = 1) and the weight vector (i.e., append a value wọ), and 
can then write 


d d 
net; = 5 TiWji + Wjo = 5 TiWi = WIX, (1) 
i=1 i=0 
where the subscript 7 indexes units on the input layer, j for the hidden; wj; denotes 
the input-to-hidden layer weights at the hidden unit j. In analogy with neurobiol- 
ogy, such weights or connections are sometimes called “synapses” and the value of 
the connection the “synaptic weights.” Each hidden unit emits an output that is a 
nonlinear function of its activation, f(net), i.e., 


Yj = F(net;). (2) 


The example shows a simple threshold or sign (read “signum”) function, 


1 if net >0 
f (net) = Sgn(net) = { -1 ifnet <0 $ 


but as we shall see, other functions have more desirable properties and are hence 
more commonly used. This f() is sometimes called the transfer function or merely 
“nonlinearity” of a unit, and serves as a ọ function discussed in Chap. ??. We have 
assumed the same nonlinearity is used at the various hidden and output units, though 
this is not crucial. 

Each output unit similarly computes its net activation based on the hidden unit 
signals as 


NH NH 
netk = S yw + Wko = S yw; = wiy, (4) 
j=1 j=0 


where the subscript k indexes units in the output layer (one, in the figure) and ny 
denotes the number of hidden units (two, in the figure). We have mathematically 
treated the bias unit as equivalent to one of the hidden units whose output is always 
Yo = 1. Each output unit then computes the nonlinear function of its net, emitting 


zk = f(netx). (5) 


where in the figure we assume that this nonlinearity is also a sign function. It is these 
final output signals that represent the different discriminant functions. We would 
typically have c such output units and the classification decision is to label the input 
pattern with the label corresponding to the maximum yx = g(x). In a two-category 
case such as XOR, it is traditional to use a single output unit and label a pattern by 
the sign of the output z. 
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Figure 6.1: The two-bit parity or exclusive-OR problem can be solved by a three-layer 
network. At the bottom is the two-dimensional feature space 1, — x2, and the four 
patterns to be classified. The three-layer network is shown in the middle. The input 
units are linear and merely distribute their (feature) values through multiplicative 
weights to the hidden units. The hidden and output units here are linear threshold 
units, each of which forms the linear sum of its inputs times their associated weight, 
and emits a +1 if this sum is greater than or equal to 0, and —1 otherwise, as shown 
by the graphs. Positive (“excitatory”) weights are denoted by solid lines, negative 
(“inhibitory”) weights by dashed lines; the weight magnitude is indicated by the 
relative thickness, and is labeled. The single output unit sums the weighted signals 
from the hidden units (and bias) and emits a +1 if that sum is greater than or equal 
to 0 and a -1 otherwise. Within each unit we show a graph of its input-output or 
transfer function — f(net) vs. net. This function is linear for the input units, a 
constant for the bias, and a step or sign function elsewhere. We say that this network 
has a 2-2-1 fully connected topology, describing the number of units (other than the 
bias) in successive layers. 


EXPRESSIVE 
POWER 
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It is easy to verify that the three-layer network with the weight values listed indeed 
solves the XOR problem. The hidden unit computing yı acts like a Perceptron, and 
computes the boundary 11 + 22 + 0.5 = 0; input vectors for which xı + z2 + 0.5 > 0 
lead to yı = 1, all other inputs lead to yı = —1. Likewise the other hidden unit 
computes the boundary xı +12—1.5= 0. The final output unit emits z1 = +1 if and 
only if both yı and ya have value +1. This gives to the appropriate nonlinear decision 
region shown in the figure — the XOR problem is solved. 


6.2.1 General feedforward operation 


From the above example, it should be clear that nonlinear multilayer networks (i.e., 
ones with input units, hidden units and output units) have greater computational or 
expressive power than similar networks that otherwise lack hidden units; that is, they 
can implement more functions. Indeed, we shall see in Sect. 6.2.2 that given sufficient 
number of hidden units of a general type any function can be so represented. 

Clearly, we can generalize the above discussion to more inputs, other nonlineari- 
ties, and arbitrary number of output units. For classification, we will have c output 
units, one for each of the categories, and the signal from each output unit is the dis- 
criminant function g(x). We gather the results from Eqs. 1, 2, 4, & 5, to express 
such discriminant functions as: 


na d 
gr(x) = zk = f Y wry f (>. WjiTi + un) + wko | - (6) 
j=l 


i=1 


This, then, is the class of functions that can be implemented by a three-layer neural 
network. An even broader generalization would allow transfer functions at the output 
layer to differ from those in the hidden layer, or indeed even different functions at 
each individual unit. We will have cause to use such networks later, but the attendant 
notational complexities would cloud our presentation of the key ideas in learning in 
networks. 


6.2.2 Expressive power of multilayer networks 


It is natural to ask if every decision can be implemented by such a three-layer network 
(Eq. 6). The answer, due ultimately to Kolmogorov but refined by others, is “yes” 
— any continuous function from input to output can be implemented in a three-layer 
net, given sufficient number of hidden units nz, proper nonlinearities, and weights. 
In particular, any posterior probabilities can be represented. In the c-category class- 
ification case, we can merely apply a max[-] function to the set of network outputs 
(just as we saw in Chap. ??) and thereby obtain any decision boundary. 

Specifically, Kolmogorov proved that any continuous function g(x) defined on the 
unit hypercube 1” (I = [0,1] and n > 2) can be represented in the form 


2n+1 d 
g(x) = 3 Zj (>. dute) (7) 


for properly chosen functions E; and %;;. We can always scale the input region of 
interest to lie in a hypercube, and thus this condition on the feature space is not 
limiting. Equation 7 can be expressed in neural network terminology as follows: each 
of 2n + 1 hidden units takes as input a sum of d nonlinear functions, one for each 
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input feature 1;. Each hidden unit emits a nonlinear function £ of its total input; the 
output unit merely emits the sum of the contributions of the hidden units. 


Unfortunately, the relationship of Kolmogorov’s theorem to practical neural net- 
works is a bit tenuous, for several reasons. In particular, the functions =; and yy; 
are not the simple weighted sums passed through nonlinearities favored in neural net- 
works. In fact those functions can be extremely complex; they are not smooth, and 
indeed for subtle mathematical reasons they cannot be smooth. As we shall soon 
see, smoothness is important for gradient descent learning. Most importantly, Kol- 
mogorov’s Theorem tells us very little about how to find the nonlinear functions based 
on data — the central problem in network based pattern recognition. 


A more intuitive proof of the universal expressive power of three-layer nets is in- 
spired by Fourier’s Theorem that any continuous function g(x) can be approximated 
arbitrarily closely by a (possibly infinite) sum of harmonic functions (Problem 2). One 
can imagine networks whose hidden units implement such harmonic functions. Proper 
hidden-to-output weights related to the coefficients in a Fourier synthesis would then 
enable the full network to implement the desired function. Informally speaking, we 
need not build up harmonic functions for Fourier-like synthesis of a desired function. 
Instead a sufficiently large number of “bumps” at different input locations, of different 
amplitude and sign, can be put together to give our desired function. Such localized 
bumps might be implemented in a number of ways, for instance by sigmoidal transfer 
functions grouped appropriately (Fig. 6.2). The Fourier analogy and bump construc- 
tions are conceptual tools, they do not explain the way networks in fact function. In 
short, this is not how neural networks “work” — we never find that through train- 
ing (Sect. 6.3) simple networks build a Fourier-like representation, or learn to group 
sigmoids to get component bumps. 


Figure 6.2: A 2-4-1 network (with bias) along with the response functions at different 
units; each hidden and output unit has sigmoidal transfer function f(-). In the case 
shown, the hidden unit outputs are paired in opposition thereby producing a “bump” 
at the output unit. Given a sufficiently large number of hidden units, any continuous 
function from input to output can be approximated arbitrarily well by such a network. 
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While we can be confident that a complete set of functions, such as all polynomi- 
als, can represent any function it is nevertheless a fact that a single functional form 
also suffices, so long as each component has appropriate variable parameters. In the 
absence of information suggesting otherwise, we generally use a single functional form 
for the transfer functions. 

While these latter constructions show that any desired function can be imple- 
mented by a three-layer network, they are not particularly practical because for most 
problems we know ahead of time neither the number of hidden units required, nor 
the proper weight values. Even if there were a constructive proof, it would be of little 
use in pattern recognition since we do not know the desired function anyway — it 
is related to the training patterns in a very complicated way. All in all, then, these 
results on the expressive power of networks give us confidence we are on the right 
track, but shed little practical light on the problems of designing and training neural 
networks — their main benefit for pattern recognition (Fig. 6.3). 
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Figure 6.3: Whereas a two-layer network classifier can only implement a linear decision 
boundary, given an adequate number of hidden units, three-, four- and higher-layer 
networks can implement arbitrary decision boundaries. The decision regions need not 
be convex, nor simply connected. 


6.3 Backpropagation algorithm 


We have just seen that any function from input to output can be implemented as a 
three-layer neural network. We now turn to the crucial problem of setting the weights 
based on training patterns and desired output. 
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Backpropagation is one of the simplest and most general methods for supervised 
training of multilayer neural networks — it is the natural extension of the LMS al- 
gorithm for linear systems we saw in Chap. ??. Other methods may be faster or 
have other desirable properties, but few are more instructive. The LMS algorithm 
worked for two-layer systems because we had an error (proportional to the square of 
the difference between the actual output and the desired output) evaluated at the 
output unit. Similarly, in a three-layer net it is a straightforward matter to find how 
the output (and thus error) depends on the hidden-to-output layer weights. In fact 
this dependency is the same as in the analogous two-layer case, and thus the learning 
rule is the same. 

But how should the input-to-hidden weights be learned, the ones governing the 
nonlinear transformation of the input vectors? If the “proper” outputs for a hidden 
unit were known for any pattern, the input-to-hidden weights could be adjusted to 
approximate it. However, there is no explicit teacher to state what the hidden unit’s 
output should be. This is called the credit assignment problem. The power of back- 
propagation is that it allows us to calculate an effective error for each hidden unit, 
and thus derive a learning rule for the input-to-hidden weights. 

Networks have two primary modes of operation: feedforward and learning. Feed- 
forward operation, such as illustrated in our XOR example above, consists of present- 
ing a pattern to the input units and passing the signals through the network in order 
to yield outputs from the output units. Supervised learning consists of presenting 
an input pattern as well as a desired, teaching or target pattern to the output layer 
and changing the network parameters (e.g., weights) in order to make the actual out- 
put more similar to the target one. Figure 6.4 shows a three-layer network and the 
notation we shall use. 


6.3.1 Network learning 


The basic approach in learning is to start with an untrained network, present an input 
training pattern and determine the output. The error or criterion function is some 
scalar function of the weights that is minimized when the network outputs match the 
desired outputs. The weights are adjusted to reduce this measure of error. Here we 
present the learning rule on a per pattern basis, and return to other protocols later. 

We consider the training error on a pattern to be the sum over output units of the 
squared difference between the desired output tz (given by a teacher) and the actual 
output zz, much as we had in the LMS algorithm for two-layer nets: 


Cc 


J(w) = 1/20 (tx — 21)? = 1/2(t — 2), (8) 


k=1 


where t and z are the target and the network output vectors of length c; w represents 
all the weights in the network. 

The backpropagation learning rule is based on gradient descent. The weights are 
initialized with random values, and are changed in a direction that will reduce the 
error: 


OJ 


or in component form 
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Figure 6.4: A d-nz-c fully connected three-layer network and the notation we shall use 
(bias not shown). During feedforward operation, a d-dimensional input pattern x is 
presented to the input layer; each input unit then emits its corresponding component 
xi. Each of the nq hidden units computes its net activation, net;, as the inner 
product of the input layer signals with weights wji at the hidden unit. The hidden 
unit emits y; = f(net;), where f(-) is the nonlinear transfer function, shown here as 
a sigmoid. Each of the c output units functions in the same manner as the hidden 
units do, computing net, as the inner product of the hidden unit signals and weights 
at the output unit. The final signals emitted by the network, zx = f(net,) are used 
as discriminant functions for classification. During network training, these output 
signals are compared with a teaching or target vector t, and any difference is used in 
training the weights throughout the network. 


OJ 


1 
Dw’ (10) 


AWmn = -n 


where 7 is the learning rate, and merely indicates the relative size of the change 
in weights. The power of Eqs. 9 & 10 is in their simplicity: they merely demand 
that we take a step in weight space that lowers the criterion function. Because this 
criterion can never be negative, moreover, this rule guarantees learning will stop 
(except in pathological cases). This iterative algorithm requires taking a weight vector 
at iteration m and updating it as: 


w(m +1) = w(m) + Aw(m), (11) 


where m indexes the particular pattern presentation (but see also Sect. 6.8). 
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We now turn to the problem of evaluating Eq. 10 for a three-layer net. Consider 
first the hidden-to-output weights, wj. Because the error is not explicitly dependent 
upon w;,, we must use the chain rule for differentiation: 


OJ OJ Onet, — . Onet, 


= = ôk 12 
Owe; netk Own; i wki (a) 

where the sensitivity of unit k is defined to be 
Ôk = —0J/Onetz, (13) 


and describes how the overall error changes with the unit's activation. We differentiate 
Eq. 8 and find that for such an output unit 6; is simply: 


OF OF 02% _ f 
netk Oz; Onety = (th — zr) f (neta). ua 


The last derivative in Eq. 12 is found using Eq. 4: 


ôk = 


netk 
OWkj 


= Y5- (15) 


Taken together, these results give the weight update (learning rule) for the hidden- 
to-output weights: 


Awki = nny; = n(tk — zn) f (net) y;- (16) 


The learning rule for the input-to-hidden units is more subtle, indeed, it is the 
crux of the solution to the credit assignment problem. From Eq. 10, and again using 
the chain rule, we calculate 


ðJ OJ Oyj Onet; 
Ow ji E Oy; Onet ; Ow ji ` 


(17) 


The first term on the right hand side requires just a bit of care: 


OJ 0 c 
“= a— 1/2) (te — 24) 
By ie [Wed y 


Oy; 


k=1 
a z Oz, netk 
a 2 (tr *) netk OY; 


— S (tx = zk) f’ (neta) wyn- (18) 
k=1 

For the second step above we had to use the chain rule yet again. The final sum over 

output units in Eq. 18 expresses how the hidden unit output, y;, affects the error at 

each output unit. In analogy with Eq. 13 we use Eq. 18 to define the sensitivity for 

a hidden unit as: 


6; = f' (nets) Y wrjôr. (19) 
k=1 


SENSITIVITY 
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Equation 19 is the core of the solution to the credit assigment problem: the sensitivity 
at a hidden unit is simply the sum of the individual sensitivities at the output units 
weighted by the hidden-to-output weights wj, all multiplied by f'(net;). Thus the 
learning rule for the input-to-hidden weights is: 


Cc 
Awji = x40; = nz; f' (net; ) EN Whj Ok- (20) 
k=1 

Equations 16 & 20, together with training protocols such as described below, give 
the backpropagation algorithm — or more specifically the “backpropagation of errors” 
algorithm — so-called because during training an “error” (actually, the sensitivities 
k) must be propagated from the output layer back to the hidden layer in order to 
perform the learning of the input-to-hidden weights by Eq. 20 (Fig. 6.5). At base then, 
backpropagation is “just” gradient descent in layered models where the chain rule 
through continuous functions allows the computation of derivatives of the criterion 

function with respect to all model parameters (i.e., weights). 


hidden 
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= 000 


Figure 6.5: The sensitivity at a hidden unit is proportional to the weighted sum of the 
[+ 
sensitivities at the output units: 6; = f’(net;) Y Wij0x. The output unit sensitivities 


are thus propagated “back” to the hidden units. 


These learning rules make intuitive sense. Consider first the rule for learning 
weights at the output units (Eq. 16). The weight update at unit k should indeed be 
proportional to (tk — zk) — if we get the desired output (zk = tx), then there should 
be no weight change. For a typical sigmoidal f(-) we shall use most often, f’(net,) is 
always positive. Thus if yj and (tx — zk) are both positive, then the actual output is 
too small and the weight must be increased; indeed, the proper sign is given by the 
learning rule. Finally, the weight update should be proportional to the input value; if 
yj = 0, then hidden unit j has no effect on the output (and hence the error), and thus 
changing wj; will not change the error on the pattern presented. A similar analysis 
of Eq. 20 yields insight of the input-to-hidden weights (Problem 5). 

Problem 7 asks you to show that the presence of the bias unit does not materially 
affect the above results. Further, with moderate notational and bookkeeping effort 
(Problem 11), the above learning algorithm can be generalized directly to feed-forward 
networks in which 


e input units are connected directly to output units (as well as to hidden units) 
e there are more than three layers of units 


e there are different nonlinearities for different layers 
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e each unit has its own nonlinearity 
e each unit has a different learning rate. 


It is a more subtle matter to perform incorporate learning into networks having con- 
nections within a layer, or feedback connections from units in higher layers back to 
those in lower layers. We shall consider such recurrent networks in Sect. ??. 


6.3.2 Training protocols 


In broad overview, supervised training consists in presenting to the network patterns 
whose category label we know — the training set — finding the output of the net and 
adjusting the weights so as to make the actual output more like the desired or teaching 
signal. The three most useful training protocols are: stochastic, batch and on-line. 
In stochastic training (or pattern training), patterns are chosen randomly from the 
training set, and the network weights are updated for each pattern presentation. This 
method is called stochastic because the training data can be considered a random 
variable. In batch training, all patterns are presented to the network before learning 
(weight update) takes place. In virtually every case we must make several passes 
through the training data. In on-line training, each pattern is presented once and 
only once; there is no use of memory for storing the patterns.* 

A fourth protocol is learning with queries where the output of the network is used 
to select new training patterns. Such queries generally focus on points that are likely 
to give the most information to the classifier, for instance those near category decision 
boundaries (Chap. ??). While this protocol may be faster in many cases, its drawback 
is that the training samples are no longer independent, identically distributed (i.i.d.), 
being skewed instead toward sample boundaries. This, in turn, generally distorts the 
effective distributions and may or may not improve recognition accuracy (Computer 
exercise ??). 

We describe the overall amount of pattern presentations by epoch — the number of 
presentations of the full training set. For other variables being constant, the number 
of epochs is an indication of the relative amount of learning.? The basic stochastic 
and batch protocols of backpropagation for n patterns are shown in the procedures 
below. 


Algorithm 1 (Stochastic backpropagation) 


1 begin initialize network topology (# hidden units), w, criterion 0,7,m — 0 
2 dom-—m+l1 

3 x'” — randomly chosen pattern 

4 Wij — Wig +NOjLiZ Wik — Wik + NOY; 

until VJ(w) < 0 

6 return w 

7 end 


a 


In the on-line version of backpropagation, line 3 of Algorithm 1 is replaced by sequen- 
tial selection of training patterns (Problem 9). Line 5 makes the algorithm end when 
the change in the criterion function J(w) is smaller than some pre-set value 0. While 
this is perhaps the simplest meaningful stopping criterion, others generally lead to 


* Some on-line training algorithms are considered models of biological learning, where the organism 
is exposed to the environment and cannot store all input patterns for multiple “presentations.” 

+ The notion of epoch does not apply to on-line training, where instead the number of pattern 
presentations is a more appropriate measure. 
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better performance, as we shall discuss in Sect. 6.8.14. 

In the batch version, all the training patterns are presented first and their corre- 
sponding weight updates summed; only then are the actual weights in the network 
updated. This process is iterated until some stopping criterion is met. 

So far we have considered the error on a single pattern, but in fact we want to 
consider an error defined over the entirety of patterns in the training set. With minor 
infelicities in notation we can write this total training error as the sum over the errors 
on n individual patterns: 


J=Y Jy. (21) 


n 
p=1 


In stochastic training, a weight update may reduce the error on the single pattern 
being presented, yet increase the error on the full training set. Given a large number 
of such individual updates, however, the total error as given in Eq. 21 decreases. 


Algorithm 2 (Batch backpropagation) 


1 begin initialize network topology (# hidden units), w, criterion 0,7,r — 0 
2 dor r+ 1 (increment epoch) 

3 m — 0; Aw;j — 0; Awjk — 0 

4 do m=m->+1l 


5 x™ — select pattern 
6 Aw — Awiy +90; Awjy — AWjk + 0x4; 
7 until m =n 
8 Wij — Wij + Awi;; Wyk — Wyk + AWjk 
9 until VJ(w)<0 
10 return w 
11 end 


In batch backpropagation, we need not select pattern randomly, since the weights 
are updated only after all patterns have been presented once. We shall consider the 
merits and drawbacks of each protocol in Sect. 6.8. 


6.3.3 Learning curves 


Because the weights are initialized with random values, error on the training set 
is large; through learning the error becomes lower, as shown in a learning curve 
(Fig. 6.6). The (per pattern) training error ultimately reaches an asymptotic value 
which depends upon the Bayes error, the amount of training data and the expressive 
power (e.g., the number of weights) in the network — the higher the Bayes error 
and the fewer the number of such weights, the higher this asymptotic value is likely 
to be (Chap. ??). Since batch backpropagation performs gradient descent in the 
criterion function, these training error decreases monotonically. The average error on 
an independent test set is virtually always higher than on the training set, and while 
it generally decreases, it can increase or oscillate. 

Figure 6.6 also shows the average error on a validation set — patterns not used 
directly for gradient descent training, and thus indirectly representative of novel pat- 
terns yet to be classified. The validation set can be used in a stopping criterion in 
both batch and stochastic protocols; gradient descent training on the training set is 
stopped when a minimum is reached in the validation error (e.g., near epoch 5 in 
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Figure 6.6: A learning curve shows the criterion function as a function of the amount 
of training, typically indicated by the number of epochs or presentations of the full 
training set. We plot the average error per pattern, i.e., 1/n ey Jp. The validation 
error and the test (or generalization) error per pattern are virtually always higher 
than the training error. In some protocols, training is stopped at the minimum of the 
validation set. 


the figure). We shall return in Chap. ?? to understand in greater depth why this 
version of cross validation stopping criterion often leads to networks having improved 
recognition accuracy. 


6.4 Error surfaces 


Since backpropagation is based on gradient descent in a criterion function, we can gain 
understanding and intuition about the algorithm by studying error surfaces themselves 
— the function J(w). Of course, such an error surface depends upon the training and 
classification task; nevertheless there are some general properties of error surfaces that 
seem to hold over a broad range of real-world pattern recognition problems. One of 
the issues that concerns us are local minima; if many local minima plague the error 
landscape, then it is unlikely that the network will find the global minimum. Does this 
necessarily lead to poor performance? Another issue is the presence of plateaus — 
regions where the error varies only slightly as a function of weights. If such plateaus 
are plentiful, we can expect training according to Algorithms 1 & 2 to be slow. Since 
training typically begins with small weights, the error surface in the neighborhood of 
w ~ 0 will determine the general direction of descent. What can we say about the 
error in this region? Most interesting real-world problems are of high dimensionality. 
Are there any general properties of high dimensional error functions? 
We now explore these issues in some illustrative systems. 


6.4.1 Some small networks 


Consider the simplest three-layer nonlinear network, here solving a two-category prob- 
lem in one dimension; this 1-1-1 sigmoidal network (and bias) is shown in Fig. 6.7. 
The data shown are linearly separable, and the optimal decision boundary (a point 
somewhat below x; = 0) separates the two categories. During learning, the weights 
descends to the global minimum, and the problem is solved. 


CROSS 
VALIDATION 
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Figure 6.7: Six one-dimensional patterns (three in each of two classes) are to be 
learned by a 1-1-1 network with sigmoidal hidden and output units (and bias). The 
error surface as a function of w; and wa is also shown (for the case where the bias 
weights have their final values). The network starts with random weights, and through 
(stochastic) training descends to the global minimum in error, as shown by the trajec- 
tory. Note especially that a low error solution exists, which in fact leads to a decision 
boundary separating the training points into their two categories. 


Here the error surface has a single (global) minimum, which yields the decision 
point separating the patterns of the two categories. Different plateaus in the surface 
correspond roughly to different numbers of patterns properly classified; the maximum 
number of such misclassified patterns is three in this example. The plateau regions, 
where weight change does not lead to a change in error, here correspond to sets of 
weights that lead to roughly the same decision point in the input space. Thus as wy 
increases and w becomes more negative, the surface shows that the error does not 
change, a result that can be informally confirmed by looking at the network itself. 

Now consider the same network applied to another, harder, one-dimensional prob- 
lem — one that is not linearly separable (Fig. 6.8). First, note that overall the error 
surface is slightly higher than in Fig. 6.7 because even the best solution attainable 
with this network leads to one pattern being misclassified. As before, the different 
plateaus in error correspond to different numbers of training patterns properly learned. 
However, one must not confuse the (squared) error measure with classification error 
(cf. Chap. ??, Fig. ??). For instance here there are two general ways to misclassify 
exactly two patterns, but these have different errors. Incidentally, a 1-3-1 network 
(but not a 1-2-1 network) can solve this problem (Computer exercise 3). 

From these very simple examples, where the correspondences among weight val- 
ues, decision boundary and error are manifest, we can see how the error of the global 
minimum is lower when the problem can be solved and that there are plateaus corre- 
sponding to sets of weights that lead to nearly the same decision boundary. Further- 
more, the surface near w œ O (the traditional region for starting learning) has high 
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Figure 6.8: As in Fig. 6.7, except here the patterns are not linearly separable; the 
error surface is slightly higher than in that figure. 


error and happens in this case to have a large slope; if the starting point had differed 
somewhat, the network would descend to the same final weight values. 


6.4.2 XOR 


A somewhat more complicated problem is the XOR problem we have already consid- 
ered. Figure ?? shows several two-dimensional slices through the nine-dimensional 
weight space of the 2-2-1 sigmoidal network (with bias). The slices shown include a 
global minimum in the error. 

Notice first that the error varies a bit more gradually as a function of a single 
weight than does the error in the networks solving the problems in Figs. 6.7 & 6.8. 
This is because in a large network any single weight has on average a smaller relative 
contribution to the output. Ridges, valleys and a variety of other shapes can all 
be seen in the surface. Several local minima in the high-dimensional weight space 
exist, which here correspond to solutions that classify three (but not four) patterns. 
Although it is hard to show it graphically, the error surface is invariant with respect 
to certain discrete permutations. For instance, if the labels on the two hidden units 
are exchanged (and the weight values changed appropriately), the shape of the error 
surface is unaffected (Problem ?7). 


6.4.3 Larger networks 


Alas, the intuition we gain from considering error surfaces for small networks gives only 
hints of what is going on in large networks, and at times can be quite misleading. Fig- 
ure 6.10 shows a network with many weights solving a complicated high-dimensional 
two-category pattern classification problem. Here, the error varies quite gradually as 
a single weight is changed though we can get troughs, valleys, canyons, and a host of 
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Figure 6.9: Two-dimensional slices through the nine-dimensional error surface after 
extensive training for a 2-2-1 network solving the XOR problem. 


shapes. 

Whereas in low dimensional spaces local minima can be plentiful, in high dimen- 
sion, the problem of local minima is different: the high-dimensional space may afford 
more ways (dimensions) for the system to “get around” a barrier or local maximum 
during learning. In networks with many superfluous weights (i.e., more than are 
needed to learn the training set), one is less likely to get into local minima. However, 
networks with an unnecessarily large number of weights are undesirable because of 
the dangers of overfitting, as we shall see in Sect. 6.11. 


6.4.4 How important are multiple minima? 


The possibility of the presence of multiple local minima is one reason that we resort to 
iterative gradient descent — analytic methods are highly unlikely to find a single global 
minimum, especially in high-dimensional weight spaces. In computational practice, we 
do not want our network to be caught in a local minimum having high training error 
since this usually indicates that key features of the problem have not been learned by 
the network. In such cases it is traditional to re-initialize the weights and train again, 
possibly also altering other parameters in the net (Sect. 6.8). 


In many problems, convergence to a non-global minimum is acceptable, if the 
error is nevertheless fairly low. Furthermore, common stopping criteria demand that 
training terminate even before the minimum is reached and thus it is not essential 
that the network be converging toward the global minimum for acceptable performance 


(Sect. 6.8.14). 
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Figure 6.10: A network with xxx weights trained on data from a complicated pattern 
recognition problem xxx. 


6.5 Backpropagation as feature mapping 


Since the hidden-to-output layer leads to a linear discriminant, the novel computa- 
tional power provided by multilayer neural nets can be attributed to the nonlinear 
warping of the input to the representation at the hidden units. Let us consider this 
transformation, again with the help of the XOR problem. 

Figure 6.11 shows a three-layer net addressing the XOR problem. For any input 
pattern in the zı — £2 space, we can show the corresponding output of the two hidden 
units in the yı — y2 space. With small initial weights, the net activation of each 
hidden unit is small, and thus the linear portion of their transfer function is used. 
Such a linear transformation from x to y leaves the patterns linearly inseparable 
(Problem 1). However, as learning progresses and the input-to-hidden weights increase 
in magnitude, the nonlinearities of the hidden units warp and distort the mapping 
from input to the hidden unit space. The linear decision boundary at the end of 
learning found by the hidden-to-output weights is shown by the straight dashed line; 
the nonlinearly separable problem at the inputs is transformed into a linearly separable 
at the hidden units. 

We can illustrate such distortion in the three-bit parity problem, where the output 
= +1 if the number of 1s in the input is odd, and -1 otherwise — a generalization 
of the XOR or two-bit parity problem (Fig. 6.12). As before, early in learning the 
hidden units operate in their linear range and thus the representation after the hid- 
den units remains linearly inseparable — the patterns from the two categories lie at 
alternating vertexes of a cube. After learning and the weights have become larger, 
the nonlinearities of the hidden units are expressed and patterns have been moved 
and can be linearly separable, as shown. 

Figure 6.13 shows a two-dimensional two-category problem and the pattern rep- 
resentations in a 2-2-1 and in a 2-3-1 network of sigmoidal hidden units. Note that 


22 CHAPTER 6. MULTILAYER NEURAL NETWORKS 


ost Epoch 
vi 
— 15 
—30 
— 45 

i Y 60 
ost 


Epoch 


Figure 6.11: A 2-2-1 backpropagation network (with bias) and the four patterns of the 
XOR problem are shown at the top. The middle figure shows the outputs of the hidden 
units for each of the four patterns; these outputs move across the y, — ya space as the 
full network learns. In this space, early in training (epoch 1) the two categories are 
not linearly separable. As the input-to-hidden weights learn, the categories become 
linearly separable. Also shown (by the dashed line) is the linear decision boundary 
determined by the hidden-to-output weights at the end of learning — indeed the 
patterns of the two classes are separated by this boundary. The bottom graph shows 
the learning curves — the error on individual patterns and the total error as a function 
of epoch. While the error on each individual pattern does not decrease monotonically, 
the total training error does decrease monotonically. 
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Figure 6.12: A 3-3-1 backpropagation network (plus bias) can indeed solve the three- 
bit parity problem. The representation of the eight patterns at the hidden units 
(yi — Ya — y3 Space) as the system learns and the (planar) decision boundary found by 
the hidden-to-output weights at the end of learning. The patterns of the two classes 
are separated by this plane. The learning curve shows the error on individual patterns 
and the total error as a function of epoch. 


in the two-hidden unit net, the categories are separated somewhat, but not enough 
for error-free classification; the expressive power of the net is not sufficiently high. 
In contrast, the three-hidden unit net can separate the patterns. In general, given 
sufficiently many hidden units in a sigmoidal network, any set of different patterns 
can be learned in this way. 


6.5.1 Representations at the hidden layer — weights 


In addition to focusing on the transformation of patterns, we can also consider the 
representation of learned weights themselves. Since the hidden-to-output weights 
merely leads to a linear discriminant, it is instead the input-to-hidden weights that 
are most instructive. In particular, such weights at a single hidden unit describe the 
input pattern that leads to maximum activation of that hidden unit, analogous to 
a “matched filter.” Because the hidden unit transfer functions are nonlinear, the 
correspondence with classical methods such as matched filters (and principal compo- 
nents, Sect. ??) is not exact; nevertheless it is often convenient to think of the hidden 
units as finding feature groupings useful for the linear classifier implemented by the 
hidden-to-output layer weights. 

Figure 6.14 shows the input-to-hidden weights (displayed as patterns) for a simple 
task of character recognition. Note that one hidden unit seems “tuned” for a pair of 
horizontal bars while the other to a single lower bar. Both of these feature groupings 
are useful building blocks for the patterns presented. In complex, high-dimensional 
problems, however, the pattern of learned weights may not appear to be simply related 
to the features we suspect are appropriate for the task. This could be because we 
may be mistaken about which are the true, relevant feature groupings; nonlinear 
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Figure 6.13: Seven patterns from a two-dimesional two-category nonlinearly separable 
classification problem are shown at the bottom. The figure at the top left shows the 
hidden unit representations of the patterns in a 2-2-1 sigmoidal network (with bias) 
fully trained to the global error minimum; the linear boundary implemented by the 
hidden-to-output weights is also shown. Note that the categories are almost linearly 
separable in this yı — y2 space, but one training point is misclassified. At the top 
right is the analogous hidden unit representation for a fully trained 2-3-1 network 
(with bias). Because of the higher dimension of the hidden layer representation, the 
categories are now linearly separable; indeed the learned hidden-to-output weights 
implement a plane that separates the categories. 


interactions between features may be significant in a problem (and such interactions 
are not manifest in the patterns of weights at a single hidden unit); or the network 
may have too many weights (degrees of freedom), and thus the feature selectivity is 
low. 

It is generally much harder to represent the hidden-to-output layer weights in 
terms of input features. Not only do the hidden units themselves already encode a 
somewhat abstract pattern, there is moreover no natural ordering of the hidden units. 
Together with the fact that the output of hidden units are nonlinearly related to the 
inputs, this makes analyzing hidden-to-output weights somewhat problematic. Often 
the best we can do is list the patterns of input weights for hidden units that have 
strong connections to the output unit in question (Computer exercise 9). 


6.6 Backpropagation, Bayes theory and probability 


While multilayer neural networks may appear to be somewhat ad hoc, we now show 
that when trained via backpropagation on a sum-squared error criterion they form a 
least squares fit to the Bayes discriminant functions. 
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Figure 6.14: The top images represent patterns from a large training set used to train 
a 64-2-3 sigmoidal network for classifying three characters. The bottom figures show 
the input-to-hidden weights (represented as patterns) at the two hidden units after 
training. Note that these learned weights indeed describe feature groupings useful for 
the classification task. In large networks, such patterns of learned weights may be 
difficult to interpret in this way. 


6.6.1 Bayes discriminants and neural networks 


As we saw in Chap. ?? Sect. ??, the LMS algorithm computed the approximation to 
the Bayes discriminant function for two-layer nets. We now generalize this result in 
two ways: to multiple categories and to nonlinear functions implemented by three- 
layer neural networks. We use the network of Fig. 6.4 and let g(x; w) be the output 
of the kth output unit — the discriminant function corresponding to category wp. 
Recall first Bayes’ formula, 

Pl P(x|wk)P (wg) _ Ple we) (22) 


> P(x|w;)P(i) P 


and the Bayes decision for any pattern x: choose the category wz having the largest 
discriminant function g(x) = P(w |x). 

Suppose we train a network having c output units with a target signal according 
to: 


_ jf 1 ifx ecu, 
te(x) = { 0 otherwise. (23) 


(In practice, teaching values of +1 are to be preferred, as we shall see in Sect. 6.8; we 
use the values 0-1 in this derivation for computational simplicity.) The contribution 
to the criterion function based on a single output unit k for finite number of training 
samples x is: 


J(w) = $ [ons w) — te]? (24) 
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where n is the total number of training patterns, nz of which are in wp. In the limit 
of infinite data we can use Bayes’ formula (Eq. 22) to express Eq. 24 as (Problem 17): 


lim Liw) = J(w) (25) 
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The backpropagation rule changes weights to minimize the left hand side of Eq. 25, 
and thus it minimizes 


/ [gas w) — P(wylx)|?p(x)ax. (26) 


Since this is true for each category wp (k = 1, 2,...,c), backpropagation minimizes the 
sum (Problem 22): 


y I lax; w) — P(wylx))? p(x)dx. (27) 
k=1 


Thus in the limit of infinite data the outputs of the trained network will approximate 
(in a least-squares sense) the true a posteriori probabilities, that is, the output units 
represent the a posteriori probabilities, 


gn (x; w) ~ P(w,|x). (28) 


Figure 6.15 illustrates the development of the learned outputs toward the Bayes dis- 
criminants as the amount of training data and the expressive power of the net in- 
creases. 

We must be cautious in interpreting these results, however. A key assumption un- 
derlying the argument is that the network can indeed represent the functions P(w,|x); 
with insufficient hidden units, this will not be true (Problem ??). Moreover, fitting 
the discriminant function does not guarantee the optimal classification boundaries are 
found, just as we saw in Chap. ??. 


6.6.2 Outputs as probabilities 


In the previous subsection we saw one way to make the c output units of a trained net 
represent probabilities by training with 0-1 target values. While indeed given infinite 
amounts of training data (and assuming the net can express the discriminants, does 
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Figure 6.15: As a network is trained via backpropagation (under the assumptions 
given in the text), its outputs more closely approximate posterior probabilities. The 
figure shows the outputs of a 1-3-2 and a 1-8-2 sigmoidal network after backpropaga- 
tion training with n = 10 and n = 1000 points from two categories. Note especially 
the excellent agreement between the large net’s outputs and the Bayesian discriminant 
functions in the regions of high p(x). 


not fall into an undesirable local minimum, etc.), then the outputs will represent 
probabilities. If, however, these conditions do not hold — in particular we have only 
a finite amount of training data — then the outputs will not represent probabilities; 
for instance there is no guarantee that they will sum to 1.0. In fact, if the sum of the 
network outputs differs significantly from 1.0 within some range of the input space, it 
is an indication that the network is not accurately modeling the posteriors. This, in 
turn, may suggest changing the network topology, number of hidden units, or other 
aspects of the net (Sect. 6.8). 


One approach toward approximating probabilities is to choose the output unit 
nonlinearity to be exponential rather than sigmoidal — f(net,) x e” — and for 
each pattern normalize the outputs to sum to 1.0, 


enetr 


y enetm 
m=i 


and to train using 0-1 target signals. This is the softmax method — a smoothed or 
continuous version of a winner-take-all nonlinearity in which the maximum output is 
transformed to 1.0, and all others reduced to 0.0. The softmax output finds theoretical 
justification if for each category w the hidden unit representations y can be assumed 
to come from an exponential distribution (Problem 20, Computer exercise 10). 


A neural network classifier trained in this manner approximates the posterior 
probabilities P(w;|x), whether or not the data was sampled from unequal priors P(w;). 
If such a trained network is to be used on problems in which the priors have been 
changed, it is a simple matter to rescale each network output, g;(x) = P(w;|x) by the 
ratio of such priors (Computer exercise 11). 
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6.7 *Related statistical techniques 


While the graphical, topological representation of networks is useful and a guide to 
intuition, we must not forget that the underlying mathematics of the feedforward 
operation is governed by Eq. 6. A number of statistical methods bear similarities 
to that equation. For instance, projection pursuit regression (or simply projection 
pursuit) implements 


Imaz 


2= 5 wy fy (vax + ujo) + wo. (30) 


j=1 


Here each v; and vjo together define the projection of the input x onto one of jmax 
different d-dimensional hyperplanes. These projections are transformed by nonlinear 
functions f;(-) whose values are then linearly combined at the output; traditionally, 
sigmoidal or Gaussian functions are used. The f;(-) have been called ridge functions 
because for peaked f;(-), one obtains ridges in two dimensions. Equation 30 imple- 
ments a mapping to a scalar function z; in a c-category classification problem there 
would be c such outputs. In computational practice, the parameters are learned in 
groups minimizing an LMS error, for instance first the components of vı and v10, then 
v2 and v20 up to v;,,,, and v;,,..o; then the wj and wo, iterating until convergence. 

Such models are related to the three-layer networks we have seen in that the vj 
and vjo are analogous to the input-to-hidden weights at a hidden unit and the effective 
output unit is linear. The class of functions f;(-) at such hidden units are more general 
and have more free parameters than do sigmoids. Moreover, such a model can have 
an output much larger than 1.0, as might be needed in a general regression task. In 
the classification tasks we have considered, a saturating output, such as a sigmoid is 
more appropriate. 

Another technique related to multilayer neural nets is generalized additive models, 
which implement 


d 
2=f (>: f(x) + uo) ; (31) 


where again f(-) is often chosen to be a sigmoid, and the functions f;() operating on 
the input features are nonlinear, and sometimes chosen to be sigmoidal. Such models 
are trained by iteratively adjusting parameters of the component nonlinearities f;(-). 
Indeed, the basic three-layer neural networks of Sect. 6.2 implement a special case of 
general additive models (Problem 24), though the training methods differ. 

An extremely flexible technique having many adjustable parameters is multivari- 
ate adaptive regression splines (MARS). In this technique, localized spline functions 
(polynomials adjusted to insure continuous derivative) are used in the initial process- 
ing. Here the output is the weighted sum of M products of splines: 


M Tk 
z= y Wk II bkr(Lq(k,r)) + Wo, (32) 
k=1 r=1 


where the kth basis function is the product of rg one-dimensional spline functions ¢p,; 
wo is a scalar offset. The splines depend on the input values x4, such as the feature 
component of an input, where the index is labeled q(k,r). Naturally, in a c-category 
task, there would be one such output for each category. 


6.8. PRACTICAL TECHNIQUES FOR BACKPROPAGATION 29 


In broad overview, training in MARS begins by fitting the data with a spline 
function along each feature dimension in turn. The spline that best fits the data (in 
a sum squared error sense) is retained. This is the r = 1 term in Eq. 32. Next, each 
of the other feature dimensions is considered, one by one. For each such dimension, 
candidate splines are selected based on the data fit using the product of that spline 
with the one previously selected, thereby giving the product r = 1 — 2. The best 
such second spline is retained, thereby giving the r = 2 term. In this way, splines are 
added incrementally up to some value rg, where some desired quality of fit is achieved. 
The weights wp are learned using an LMS criterion. 

For several reasons, multilayer neural nets have all but supplanted projection pur- 
suit, MARS and earlier related techniques in practical pattern recognition research. 
Backpropagation is simpler than learning in projection pursuit and MARS, especially 
when the number of training patterns and the dimension is large; heuristic informa- 
tion can be incorporated more simply into nets (Sect. 6.8.12); nets admit a variety of 
simplification or regularization methods (Sec. 6.11) that have no direct counterpart 
in those earlier methods. It is, moreover, usually simpler to refine a trained neural 
net using additional training data than it is to modify classifiers based on projection 
pursuit or MARS. 


6.8 Practical techniques for improving backpropa- 
gation 


When creating a multilayer neural network classifier, the designer must make two ma- 
jor types of decision: selection of the architecture and selection of parameters (though 
the distinction is not always crisp or important). Our goal here is to give a princi- 
pled basis for making such choices based on learning speed and optimal recognition 
performance. In practice, while parameter adjustment is problem dependent several 
rules of thumb emerge from an analysis of networks. 


6.8.1 Transfer function 


There are a number of desirable properties for f(-), but we must not lose sight of the 
fact that backpropagation will work with virtually any transfer function, given that 
a few simple conditions such as continuity of f and its derivative are met. In any 
particular classification problem we may have a good reason for selecting a particular 
transfer function. For instance, if we have prior information that the distributions 
arise from a mixture of Gaussians, then Gaussian transfer functions are appropriate 
(Sect. ??). 

When not guided by such problem dependent information, what general proper- 
ties might we seek in f(-)? First, of course, f(-) must be nonlinear — otherwise the 
three-layer network provides no computational power above that of a two-layer net 
(Problem 1). A second desirable property is that f(-) saturate, i.e., have some maxi- 
mum and minimum output value. This will keep the weights and activations bounded, 
and thus keep training time limited. (This property is less desirable in networks used 
for regression, since there we may seek outputs values greater than any saturation 
level selected before training.) A third property is continuity and smoothness, i.e., 
that f(-) and f’(-) be defined throughout the range of their argument. Recall that 
the fact that we could take a derivative of f(-) was crucial in the derivation of the 
backpropagation learning rule. The rule would not, therefore, work with the threshold 
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or sign function of Eq. 3. Backpropagation can be made to work with piecewise linear 
transfer functions, but with added complexity and few benefits. 

Monotonicity is another convenient (but non-essential) property for f(-) — we 
might wish the derivative have the same sign throughout the range of the argument, 
e.g., f’(-) > 0. If f is not monotonic, additional (and undesirable) local extremum in 
the error surface may become introduced (Computer Exercise ??). Non-monotonic 
transfer functions such as radial basis functions can be used if proper care is taken 
(Sect. 6.10.1). Another desirable property is linearity for small value of net, which will 
enable the system to implement a linear model if adequate for low error. A property 
that is might occasionally be of importance is computational simplicity — we seek a 
function whose value and derivative can be easily computed. 

We mention in passing that polynomial classifiers use transfer functions of the 
form 21,29, ..., Ud, 27, 03, ..., 12, 012£2, ...,£1La, and so forth — all terms up to some 
limit; training is via gradient descent too. One drawback is that the outputs of the 
hidden units ($ functions) can become extremely large even for realistic problems 
(Problem 29, Computer exercise ??). Instead, standard neural networks employ the 
same nonlinearity at each hidden unit. 

One class of function that has all the above properties is the sigmoid such as a 
hyperbolic tangent. The sigmoid is smooth, differentiable, nonlinear, and saturating. 
It also admits a linear model if the network weights are small. A minor benefit is that 
the derivative f’(-) can be easily expressed in terms of f(-) itself (Problem 10). One 
last benefit of the sigmoid is that it maximizes information transmission for features 
that are normally distributed (Problem 25). 

A hidden layer of sigmoidal units affords a distributed or global representation 
of the input. That is, any particular input x is likely to yield activity throughout 
several hidden units. In contrast, if the hidden units have transfer functions that have 
significant response only for inputs within a small range, then an input x generally 
leads to fewer hidden units being active — a local representation. (Nearest neighbor 
classifiers employ local representations, of course.) It is often found in practice that 
when there are few training points, distributed representations are superior because 
more of the data influences the posteriors at any given input region (Computer exercise 
14). 

The sigmoid is the most widely used transfer function for the above reasons, and 
in much of the following we shall employ sigmoids. 


6.8.2 Parameters for the sigmoid 


Given that we will use the sigmoidal form, there remain a number of parameters 
to set. It is best to keep the function centered on zero and anti-symmetric, i.e., 
f(—net) = — f (net), rather than one whose value is always positive. Together with 
the data preprocessing described in Sec. 6.8.3, anti-symmetric sigmoids speed learning 
by eliminating the need to learn the mean values of the training data. Thus, sigmoid 
functions of the form 


hs e? net 2a 
f (net) =a tanh(b net) = 4 E + eb =| z 1 + eb net a (33) 
work well. The overall range and slope are not important, since it is their relationship 
to parameters such as the learning rate and magnitudes of the inputs and targets 
that determine learning times (Problem 23). For convenience, though, we choose 


a = 1.716 and b = 2/3 in Eq. 33 — values which insure f’(0) = 1, that the linear 
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range is —1 < net < +1, and that the extrema of the second derivative occur roughly 
at net ~ +2 (Fig. 6.16). 


finet) 
1.5 


A 


> net 


> net 


Figure 6.16: A useful transfer function f(net) is an anti-symmetric sigmoid. For the 
parameters given in the text, f(net) is nearly linear in the range —1 < net < +1 and 
its second derivative, f” (net), has extrema near net ~ +2. 


6.8.3 Scaling input 


Suppose we were using a two-input network to classify fish based on the features of 
mass (measured in grams) and length (measured in meters). Such a representation 
will have serious drawbacks for a neural network classifier: the numerical value of 
the mass will be orders of magnitude larger than that for length. During training the 
network will adjust weights from the “mass” input unit far more than for the “length” 
input — indeed the error will hardly depend upon the tiny length values. If however, 
the same physical information were presented but with mass measured in kilograms 
and length in millimeters, the situation would be reversed. Naturally we do not want 
our classifier to prefer one of these features over the other, since they differ solely in 
the arbitrary representation. The difficulty arises even for features having the same 
units but differing overall magnitude, of course, for instance if a fish’s length and its 
fin thickness were both measured in millimeters. 

In order to avoid such difficulties, the input patterns should be shifted so that the 
average (over the training set) of each feature is zero. Moreover, the full data set 
should then be scaled to have the same variance in each feature component — here 
chosen to be 1.0 for reasons that will be clear in Sect. 6.8.8. That is, we standardize the 
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training patterns. This data standardization is done once, before actually network 
training, and thus represents a small one-time computational burden (Problem 27, 
Computer exercise 15). Standardization can only be done for stochastic and batch 
learning protocols, but not on-line protocols where the full data set is never available 
at any one time. 


6.8.4 Target values 


For pattern recognition, we typically train with the pattern and its category label, 
and thus we use a one-of-c representation for the target vector. Since the output units 
saturate at +1.716, we might naively feel that the target values should be those values; 
however, that would present a difficulty. For any finite value of netz., the output would 
be less than the saturation values, and thus there would be error. Full training would 
never terminate as weights would become extremely large as net; would be driven to 
T 00. 

This difficulty can be avoided by using teaching values of +1 for the target cat- 
egory and -1 for the non-target categories. For instance, in a four-category prob- 
lem if the pattern is in category w3, the following target vector should be used: 
t = (-1,-1,+1,-—1). Of course, this target representation yields efficient learning for 
categorization — the outputs here do not represent posterior probabilities (Sec. 6.6.2). 


6.8.5 Training with noise 


When the training set is small, one can generate virtual or surrogate training pat- 
terns and use them as if they were normal training patterns sampled from the source 
distributions. In the absence of problem-specific information, a natural assumption 
is that such surrogate patterns should be made by adding d-dimensional Gaussian 
noise to true training points. In particular, for the standardized inputs described in 
Sect. 6.8.3, the variance of the added noise should be less than 1.0 (e.g., 0.1) and the 
category label left unchanged. This method of training with noise can be used with 
virtually every classification method, though it generally does not improve accuracy 
for highly local classifiers such as ones based on the nearest neighbor (Problem 30). 


6.8.6 Manufacturing data 


If we have knowledge about the sources of variation among patterns (for instance due 
to geometrical invariances), we can “manufacture” training data that conveys more 
information than does the method of training with uncorrelated noise (Sec. 6.8.5). 
For instance, in an optical character recognition problem, an input image may be pre- 
sented rotated by various amounts. Hence during training we can take any particular 
training pattern and rotate its image to “manufacture” a training point that may be 
representative of a much larger training set. Likewise, we might scale a pattern, per- 
form simple image processing to simulate a bold face character, and so on. If we have 
information about the range of expected rotation angles, or the variation in thickness 
of the character strokes, we should manufacture the data accordingly. 

While this method bears formal equivalence to incorporating prior information in 
a maximum likelihood approach, it is usually much simpler to implement, since we 
need only the (forward) model for generating patterns. As with training with noise, 
manufacturing data can be used with a wide range of pattern recognition methods. 
A drawback is that the memory requirements may be large and overall training slow. 
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6.8.7 Number of hidden units 


While the number of input units and output units are dictated by the dimensionality 
of the input vectors and the number of categories, respectively, the number of hidden 
units is not simply related to such obvious properties of the classification problem. 
The number of hidden units, ny, governs the expressive power of the net — and 
thus the complexity of the decision boundary. If the patterns are well separated or 
linearly separable, then few hidden units are needed; conversely, if the patterns are 
drawn from complicated densities that are highly interspersed, then more hiddens are 
needed. Thus without further information there is no foolproof method for setting 
the number of hidden units before training. 


Figure 6.17 shows the training and test error on a two-category classification prob- 
lem for networks that differ solely in their number of hidden units. For large nz, the 
training error can become small because such networks have high expressive power and 
become tuned to the particular training data. Nevertheless, in this regime, the test 
error is unacceptably high, an example of overfitting we shall study again in Chap. ??. 
At the other extreme of too few hidden units, the net does not have enough free pa- 
rameters to fit the training data well, and again the test error is high. We seek some 
intermediate number of hidden units that will give low test error. 
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Figure 6.17: The error per pattern for networks fully trained but differing in the 
numbers of hidden units, ny. Each 2-ng-1 network (with bias) was trained with 
90 two-dimensional patterns from each of two categories (sampled from a mixture of 
three Gaussians); thus n = 180. The minimum of the test error occurs for networks in 
the range 4 < ny < 5, i.e., the range of weights 17 to 21. This illustrates the rule of 
thumb that choosing networks with roughly n/10 weights often gives low test error. 


The number of hidden units determines the total number of weights in the net 
— which we consider informally as the number of degrees of freedom — and thus 
we should not have more weights than the total number of training points, n. A 
convenient rule of thumb is to choose the number of hidden units such that the total 
number of weights in the net is roughly n/10. This seems to work well over a range 
of practical problems. A more principled method is to adjust the complexity of the 
network in response to the training data, for instance start with a “large” number of 
hiddens and prune or eliminate weights — techniques we shall study in Sect. ?? and 
Chap. ??. 
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6.8.8 Initializing weights 


Suppose we have fixed the network topology, and thus set the number of hidden 
units. We now seek to set the initial weight values in order to have fast and uniform 
learning, i.e., all weights reach their final equilibrium values at about the same time. 
One form of non-uniform learning occurs when category w; is learned well before wj. 
In this undesirable case, the distribution of errors differs markedly from Bayes, and 
the overall error rate is typically higher than necessary. (The data standarization 
described above also helps to insure uniform learning.) 

In setting weights in a given layer, we choose weights randomly from a single dis- 
tribution to help insure uniform learning. Because data standardization gives positive 
and negative values equally, on average, we want positive and negative weights as well; 
thus we choose weights from a uniform distribution —w < w < +Ù, for some w yet 
to be determined. If w is chosen too small, the net activation of a hidden unit will be 
small and the linear model will be implemented. Alternatively, if w is too large, the 
hidden unit may saturate even before learning begins. Hence we set w such that the 
net activation at a hidden unit is in the range —1 < net; < +1, since net; ~ +1 are 
the limits to its linear range (Fig. 6.16). 


In order to calculate w, consider a hidden unit having a fan-in of d inputs. Suppose 
too that all weights have the same value w. On average, then, the net activation from 
d random variables of variance 1.0 from our standarized input through such weights 
will be Vd. As mentioned, we would like this net activation to be roughly in the 
range —1 < net < +1. This implies that ù = 1/ Vd and thus input weights should 
be chosen in the range —1/Vd < wji < +1/Vd. The same argument holds for the 
hidden-to-output weights, where the fan-in is ny; hidden-to-output weights should 
initialized with values chosen in the range —1/\/ny < Wkj < +1/ynH. 


6.8.9 Learning rates 


In principle, so long as the learning rate is small enough to assure convergence, its 
value determines only the speed at which the network attains a minimum in the 
criterion function J(w), not the final weight values themselves. In practice, however, 
because networks are rarely fully trained to a training error minimum (Sect. 6.8.14), 
the learning rate can affect the quality of the final network. If some weights converge 
significantly earlier than others (non-uniform learning) then the network may not 
perform equally well throughout the full range of inputs, or equally well for patterns 
in each category. Figure 6.18 shows the effect of different learning rates on convergence 
in a single dimension. 

The optimal learning rate is the one which leads to the local error minimum in one 
learning step. A principled method of setting the learning rate comes from assuming 
the criterion function can be reasonably approximated by a quadratic which thus gives 
(Fig. 6.19): 
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The optimal rate is found directly to be 


PIN 
Mopt = (=) ” (35) 
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Figure 6.18: Gradient descent in a one-dimensional quadratic criterion with different 
learning rates. If 7 < Nop+, convergence is assured, but training can be needlessly 
slow. If 7 = mopt, a single learning step suffices to find the error minimum. If 
Nopt < N < Nopti, the system will oscillate but nevertheless converge, but training is 
needlessly slow. If 7 > 2Nop+, the system diverges. 


Of course the maximum learning rate that will give convergence is max = 2Mopt- It 
should be noted that a learning rate 7 in the range opt < N < 2Nop+ will lead to slower 
convergence (Computer exercise 8). 
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Figure 6.19: If the criterion function is quadratic (above), its derivative is linear (be- 
low). The optimal learning rate Nop+ insures that the weight value yielding minimum 
error, w* is found in a single learning step. 


Thus, for rapid and uniform learning, we should calculate the second derivative of 
the criterion function with respect to each weight and set the optimal learning rate 
separately for each weight. We shall return in Sect. ?? to calculate second derivatives 
in networks, and to alternate descent and training methods such as Quickprop that 
give fast, uniform learning. For typical problems addressed with sigmoidal networks 
and parameters discussed throughout this section, it is found that a learning rate 
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of 7 ~ 0.1 is often adequate as a first choice, and lowered if the criterion function 
diverges, or raised if learning seems unduly slow. 


6.8.10 Momentum 


Error surfaces often have plateaus — regions in which the slope dJ(w)/dw is very 
small — for instance because of “too many” weights. Momentum — loosely based 
on the notion from physics that moving objects tend to keep moving unless acted 
upon by outside forces — allows the network to learn more quickly when plateaus 
in the error surface exist. The approach is to alter the learning rule in stochastic 
backpropagation to include some fraction a of the previous weight update: 


w(m + 1) = w(m) + Aw(m) + aAw(m — 1) (36) 
E— er § pr) 
gradient momentum 
descent 


Of course, a must be less than 1.0 for stability; typical values are a ~ 0.9. It must 
be stressed that momentum rarely changes the final solution, but merely allows it to 
be found more rapidly. Momentum provides another benefit: effectively “averaging 
out” stochastic variations in weight updates during stochastic learning and thereby 
speeding learning, even far from error plateaus (Fig. 6.20). 
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Figure 6.20: The incorporation of momentum into stochastic gradient descent by 
Eq. 36 (white arrows) reduces the variation in overall gradient directions and speeds 
learning, especially over plateaus in the error surface. 


Algorithm 3 shows one way to incorporate momentum into gradient descent. 


Algorithm 3 (Stochastic backpropagation with momentum) 


1 begin initialize topology (# hidden units), w, criterion, a(< 1),0,17,m — 0, bj: < 


2 dom=m+1 

3 x™ — randomly chosen pattern 

4 bji — NÓGE + abji; bey — NkYj + aby; 
5 Wji — Wji + bii Wkj — Wkj + bkj 

6 until VJ(w) <0 
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7 return w 
8 end 


6.8.11 Weight decay 


One method of simplifying a network and avoiding overfitting is to impose a heuristic 
that the weights should be small. There is no principled reason why such a method 
of “weight decay” should always lead to improved network performance (indeed there 
are occasional cases where it leads to degraded performance) but it is found in most 
cases that it helps. The basic approach is to start with a network with “too many” 
weights (or hidden units) and “decay” all weights during training. Small weights favor 
models that are more nearly linear (Problems 1 & 41). One of the reasons weight 
decay is so popular is its simplicity. After each weight update every weight is simply 
“decayed” or shrunk according to: 


wree = word _ €), (37) 


where 0 < e < 1. In this way, weights that are not needed for reducing the criterion 
function become smaller and smaller, possibly to such a small value that they can be 
eliminated altogether. Those weights that are needed to solve the problem cannot de- 
cay indefinitely. In weight decay, then, the system achieves a balance between pattern 
error (Eq. 60) and some measure of overall weight. It can be shown (Problem 43) that 
the weight decay is equivalent to gradient descent in a new effective error or criterion 
function: 


Ze 


Jef = J(w) + a (38) 


The second term on the right hand side of Eq. 38 preferentially penalizes a single large 
weight. Another version of weight decay includes a decay parameter that depends 
upon the value of the weight itself, and this tends to distribute the penalty throughout 
the network: 


2 
E A (39) 
(1 + War) 
We shall discuss principled methods for setting e, and see how weight decay is an 
instance of a more general regularization procedure in Chap. ??. 


6.8.12 Hints 


Often we have insufficient training data for adequate classification accuracy and we 
would like to add information or constraints to improve the network. The approach 
of learning with hints is to add output units for addressing an ancillary problem, one 
related to the classification problem at hand. The expanded network is trained on the 
classification problem of interest and the ancillary one, possibly simultaneously. For 
instance, suppose we seek to train a network to classify c phonemes based on some 
acoustic input. In a standard neural network we would have c output units. In learning 
with hints, we might add two ancillary output units, one which represents vowels and 
the other consonants. During training, the target vector must be lengthened to include 
components for the hint outputs. During classification the hint units are not used; 
they and their hidden-to-output weights can be discarded (Fig. 6.21). 
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Figure 6.21: In learning with hints, the output layer of a standard network having 
c units (discriminant functions) is augmented with hint units. During training, the 
target vectors are also augmented with signals for the hint units. In this way the 
input-to-hidden weights learn improved feature groupings. During classification the 
hint units are not used, and thus they and their hidden-to-output weights are removed 
from the trained network. 


The benefit provided by hints is in improved feature selection. So long as the hints 
are related to the classification problem at hand, the feature groupings useful for the 
hint task are likely to aid category learning. For instance, the feature groupings useful 
for distinguishing vowel sounds from consonants in general are likely to be useful 
for distinguishing the /b/ from /oo/ or the /g/ from /ii/ categories in particular. 
Alternatively, one can train just the hint units in order to develop improved hidden 
unit representations (Computer exercise 16). 

Learning with hints illustrates another benefit of neural networks: hints are more 
easily incorporated into neural networks than into classifiers based on other algo- 
rithms, such as the nearest-neighbor or MARS. 


6.8.13 On-line, stochastic or batch training? 


Each of the three leading training protocols described in Sect. 6.3.2 has strengths and 
drawbacks. On-line learning is to be used when the amount of training data is so 
large, or that memory costs are so high, that storing the data is prohibitive. Most 
practical neural network classification problems are addressed instead with batch or 
stochastic protocols. 

Batch learning is typically slower than stochastic learning. To see this, imag- 
ine a training set of 50 patterns that consists of 10 copies each of five patterns 
(x!,x?,...,x°). In batch learning, the presentations of the duplicates of x! provide as 
much information as a single presentation of x! in the stochastic case. For example, 
suppose in the batch case the learning rate is set optimally. The same weight change 
can be achieved with just a single presentation of each of the five different patterns in 
the batch case (with learning rate correspondingly greater). Of course, true problems 
do not have exact duplicates of individual patterns; nevertheless, true data sets are 
generally highly redundant, and the above analysis holds. 

For most applications — especially ones employing large redundant training sets 
— stochastic training is hence to be preferred. Batch training admits some second- 
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order techniques that cannot be easily incorporated into stochastic learning protocols 
and in some problems should be preferred, as we shall see in Sect. ??. 


6.8.14 Stopped training 


In three-layer networks having many weights, excessive training can lead to poor 
generalization, as the net implements a complex decision boundary “tuned” to the 
specific training data rather than the general properties of the underlying distribu- 
tions. In training the two-layer networks of Chap. ??, we could train as long as we like 
without fear that it would degrade final recognition accuracy because the complexity 
of the decision boundary is not changed — it is always simply a hyperplane. This 
example shows that the general phenomenon should be called “overfitting,” and not 
“overtraining.” 

Because the network weights are initialized with small values, the units operate in 
their linear range and the full network implements linear discriminants. As training 
progresses, the nonlinearities of the units are expressed and the decision boundary 
warps. Qualitatively speaking, stopping the training before gradient descent is com- 
plete can help avoid overfitting. In practice, the elementary criterion of stopping when 
the error function decreases less than some preset value (e.g., line ?? in Algorithm ??), 
does not lead reliably to accurate classifiers as it is hard to know beforehand what an 
appropriate threshold @ should be set. A far more effective method is to stop training 
when the error on a separate validation set reaches a minimum (Fig. ??). We shall 
explore the theory underlying this version of cross validation in Chap. ??. We note 
in passing that weight decay is equivalent to a form of stopped training (Fig. 6.22). 
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Figure 6.22: When weights are initialized with small magnitudes, stopped training 
is equivalent to a form of weight decay since the final weights are smaller than they 
would be after extensive training. 


6.8.15 How many hidden layers? 


The backpropagation algorithm applies equally well to networks with three, four, or 
more layers, so long as the units in such layers have differentiable transfer functions. 
Since, as we have seen, three layers suffice to implement any arbitrary function, we 
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would need special problem conditions or requirements recommend the use of more 
than three layers. 

One possible such requirement is translation, rotation or other distortion invari- 
ances. If the input layer represents the pixel image in an optical character recognition 
problem, we generally want such a recognizer to be invariant with respect to such 
transformations. It is easier for a three-layer net to accept small translations than to 
accept large ones. In practice, then, networks with several hidden layers distribute 
the invariance task throughout the net. Naturally, the weight initialization, learning 
rate, data preprocessing arguments apply to these networks too. The Neocognitron 
network architecture (Sec. 6.10.7) has many layers for just this reason (though it is 
trained by a method somewhat different than backpropagation). It has been found 
empirically that networks with multiple hidden layers are more prone to getting caught 
in undesirable local minima. 

In the absence of a problem-specific reason for multiple hidden layers, then, it is 
simplest to proceed using just a single hidden layer. 


6.8.16 Criterion function 


The squared error criterion of Eq. 8 is the most common training criterion because 
it is simple to compute, non-negative, and simplifies the proofs of some theorems. 
Nevertheless, other training criteria occasionally have benefits. One popular alternate 
is the cross entropy which for n patterns is of the form: 


n Cc 
I(W) ce = 5 > tmkln(tmk/Zmk), (40) 
m=1 k=1 
where tmk and zm are the target and the actual output of unit k for pattern m. Of 
course, this criterion function requires both the teaching and the output values in the 
range (0,1). 

Regularization and overfitting avoidance is generally achieved by penalizing com- 
plexity of models or networks (Chap. ??). In regularization, the training error and the 
complexity penalty should be of related functional forms. Thus if the pattern error is 
the sum of squares, then a reasonable network penalty would be squared length of the 
total weight vector (Eq. 38). Likewise, if the model penalty is some description length 
(measured in bits), then a pattern error based on cross entropy would be appropriate 
(Eq. 40). 

Yet another criterion function is based on the Minkowski error: 


n c 
I Mink (w) = 5 Y lZmk (x) —tmk (x)|?, (41) 
m=1k=1 

much as we saw in Chap. ??. It is a straightforward matter to derive the backpropa- 
gation rule for the this error (Problem ??). While in general the rule is a bit more 
complex than for the (R = 2) sum squared error we have considered (since it includes 
a Sgn|-] function), the Minkowski error for 1 < R < 2 reduces the influence of long 
tails in the distributions — tails that may be quite far from the category decision 
boundaries. As such, the designer can adjust the “locality” of the classifier indirectly 
through choice of R; the smaller the R, the more local the classifier. 


Most of the heuristics described in this section can be used alone or in combination 
with others. While they may interact in unexpected ways, all have found use in 
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important pattern recognition problems and classifier designers should have experience 
with all of them. 


6.9 *Second-order methods 


We have used a second-order analysis of the error in order to determine the optimal 
learning rate. One can use second-order information more fully in other ways. 


6.9.1 Hessian matrix 


We derived the first-order derivatives of a sum-squared-error criterion function in 
three-layer networks, summarized in Eqs. 16 & 20. We now turn to second-order 
derivatives, which find use in rapid learning methods, as well as some pruning or 
regularization algorithms. For our criterion function, 


J(w) = 


NI = 


5 (tm — 2m)”, (42) 


m=1 


where tm and Zm are the target and output signals, and n the total number of training 
patterns. The elements in the Hessian matrix are 


0? J(w) aS ðJ OJ ye $ 02] (43) 


OW jiOWik OW jiOWik 


where we have used the subscripts to refer to any weight in the network — thus i, 7, | 
and k could all take on values that describe input-to-hidden weights, or that describe 
hidden-to-output weights, or mixtures. Of course the Hessian matrix is symmetric. 
The second term in Eq. 43 is often neglected as ; this approximation guarantees that 
the resulting approximation is positive definite. 

The second term is of order O(||t — o||); using Fisher’s method of scoring we set this 
term to zero. This gives the expected value, a positive definite matrix thereby guar- 
anteeing that gradient descent will progress. In this so-called Levenberg-Marquardt or 
outer product approximation our Hessian reduces to: 

The full exact calculation of the Hessian matrix for a three-layer network such as 
we have considered is (Problem 31): 

If the two weights are both in the hidden-to-output layer: 


outout (44) 


If the two weights are both in the input-to-hidden layer: 
inin (45) 
If the weights are in different layers: 


inout (46) 


LEVENBERG- 
MARQUARDT 
APPROXIMA- 
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6.9.2 Newton’s method 


AJ(w) = J(w+ Aw) — J(w) 
(Se) a a Aw'HAw, (47) 


where H is the Hessian matrix. We differentiate Eq. 47 with respect to A and find 
that AJ(w) is minimized for 


OJ(w) 
—— |] + HAw =0 48 
( Ow ) í (59) 
and thus the optimum change in weights can be expressed as 
0J(w) 
1 
Aw = -H = | 4 
( Ow ) ( 2) 


Thus, if we have an estimate for the optimal weights w(m), we can get an improved 
estimate using the weight change given by Eq. 49, i.e., 


OJ 
w(m 1) = wl) + Aw ven) -H n) (AEE) o) 
Thus in this Newton’s algorithm, we iteratively recompute w. 
Alas, the computation of the Hessian can be expensive, and there is no guarantee 
that the Hessian is nonsingular. 
XXX 


6.9.3 Quickprop 


The simplest method for using second-order information to increase training speed is 
the Quickprop algorithm. In this method, the weights are assumed to be independent, 
and the descent is optimized separately for each. The error surface is assumed to be 
quadratic (i.e., a parabola) and the coefficients for the parabola are determined by 
two successive evaluations of J(w) and dJ(w)/dw. The single weight w is then moved 
to the computed minimum of the parabola (Fig. 6.23). It can be shown (Problem 34) 
that this approach leads to the following weight update rule: 


dl 
Aw(m + 1) = a ase ar Avim): (51) 
dwlm-1 el 


If the third- and higher-order terms in the error are non-negligible, or ifthe assumption 
of weight independence does not hold, then the computed error minimum will not 
equal the true minimum, and further weight updates will be needed. When a number 
of obvious heuristics are imposed — to reduce the effects of estimation error when 
the surface is nearly flat, or the step actually increases the error — the method can 
be significantly faster than standard backpropagation. Another benefit is that each 
weight has, in effect, its own learning rate, and thus weights tend to converge at 
roughly the same time, thereby reducing problems due to nonuniform learning. 
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Figure 6.23: The quickprop weight update takes the error derivatives at two points 
separated by a known amount, and by Eq. 51 makes its next weight value. If the 
error can be fully expressed as a second-order function, then the weight update leads 
to the weight (w*) leading to minimum error. 


6.9.4 Conjugate gradient descent 


Another fast learning method is conjugate gradient descent, which employs a series 
of line searches in weight or parameter space. One picks the first descent direction 
(for instance, determined by the gradient) and moves along that direction until the 
minimum in error is reached. The second descent direction is then computed: this 
direction — the “conjugate direction” — is the one along which the gradient does not 
change its direction, but merely its magnitude during the next descent. Descent along 
this direction will not “spoil” the contribution from the previous descent iterations 
(Fig. ??). 


H-=-I 


Figure 6.24: Conjugate gradient descent in weight space employs a sequence of 
line searches. If Aw(1) is the first descent direction, the second direction obeys 
Aw‘ (1)HAw/(2) = 0. Note especially that along this second descent, the gradient 
changes only in magnitude, not direction; as such the second descent does not “spoil” 
the contribution due to the previous line search. In the case where the Hessian is 
diagonal (right), the directions of the line searches are orthogonal. 


More specifically, if we let Aw(m — 1) represent the direction of a line search on 
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step m— 1. (Note especially that this is not an overall magnitude of change, which is 
determined by the line search). We demand that the subsequent direction, Aw(m), 
obey 


Aw!*(m — 1)HAw(m) =0, (52) 


where H is the Hessian matrix. Pairs of descent directions that obey Eq. 52 are 
called “conjugate.” If the Hessian is proportional to the identity matrix, then such 
directions are orthogonal in weight space. Conjugate gradient requires batch training, 
since the Hessian matrix is defined over the full training set. 

The descent direction on iteration m is in the direction of the gradient plus a 
component along the previous descent direction: 


Aw(m) = —VJ(w(m)) + BrAw(m — 1), (53) 


and the relative proportions of these contributions is governed by 8. This proportion 
can be derived by insuring that the descent direction on iteration m does not spoil 
that from direction m — 1, and indeed all earlier directions. It is generally calculated 
in one of two ways. The first formula (Fletcher-Reeves) is 


[VJ(w(m))1" VJ (w(m)) 
[VJ (w(m =D) VJ (w(m — 1) 


Bm = (54) 


A slightly preferable formula (Polak-Ribiere) is more robust in non-quadratic error 
functions is: 


[VJ (w(m))]* [VI (w(m)) — VJ(w(m — 1))] 
[VJ (w(m = 1))* VJ(w(m — 1)) 


Equations 53 & 36 show that conjugate gradient descent algorithm is analogous 
to calculating a “smart” momentum, where P plays the role of a momentum. If the 
error function is quadratic, then the convergence of conjugate gradient descent is 
guaranteed when the number of iterations equals the total number of weights. 


Example 1: Conjugate gradient descent | 


Consider finding the miminimum of a simple quadratic criterion function centered 
on the origin of weight space, J(w) = 1/2(.2w? + w2) = w'Hw, where by simple 
differentiation the Hessian is found to be H = de J: We start descent descent at a 


randomly selected position, which happens to be w(0) = (S); as shown in the figure. 
The first descent direction is determined by a simple gradient, which is easily found to 
be —AJ(w(0)) = — Gas = Gar In typical complex problems in high dimensions, 
the minimum along this direction is found using a line search, in this simple case the 
minimum can be found be calculus. We let s represent the distance along the first 


descent direction, and find its value for the minimum of J(w) according to: 


HIG)“ CS] 


which has solution s = 0.562. Therefore the minimum along this direction is 
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w(1) w(0) + 0.562(—AJ(w(0))) 


—8 3.2 —6.202 
= c) + 0.562( 8 ) = Ge } 
Now we turn to the use of conjugate gradients for the next descent. The simple 
gradient evaluated at w(1) is 


-aswy=-($0) = (28) 


(It is easy to verify that this direction, shown as a black arrow in the figure, does not 
point toward the global minimum at w = (?).) We use the Fletcher-Reeves formula 
(Eq. 54) to construct the conjugate gradient direction: 


B = [AJ(w(1))FAF(w(l)) _ (22.48 99) (59) _ 7.13 


[A.J (w(0))]*AJ(w(0)) (32943 za 70.096. 


Incidentally, for this quadratic error surface, the Polak-Ribiere formula (Eq. 55) would 
give the same value. Thus the conjugate descent direction is 


Ads AN E) Z e) 


7.5 


-7.5 


-7.5 -5 -2.5 0 2.5 5 7.5 
Wy 

Conjugate gradient descent in a quadratic error landscape, shown in contour plot, 
starts at a random point w(0) and descends by a sequence of line searches. The first 
direction is given by the standard gradient and terminates at a minimum of the error 
— the point w(1). Standard gradient descent from w(1) would be along the black 
vector, “spoiling” some of the gains made by the first descent; it would, furthermore, 
miss the global minimum. Instead, the conjugate gradient (red vector) does not spoil 
the gains from the first descent, and properly passes through the global error minimum 
at w = (5). 
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As above, rather than perform a traditional line search, we use calculus to find the 
error minimum along this second descent direction: 


a 
ds 


d |[/—6.202 2.788\]* /.2.0\ [ (—6.202 2.788 

= +8 +8 

ds 0.496 —.223 01 0.496 —.223 
which has solution s = 2.231. This yields the next minimum to be 


w(2) = w(1) + sAw(1) = ea + a = ke) 


Indeed, the conjugate gradient search finds the global minimum in this quadratic 
error function in two search steps — the number of dimensions of the space. 


(iwa) + sAw(1)]' H [w(1) + sAw(1)]| = 


l 
S 


6.10 *Additional networks and training methods 


The elementary method of gradient descent used by backpropagation can be slow, 
even with straightforward improvements. We now consider some alternate networks 
and training methods. 


6.10.1 Radial basis function networks (RBF') 


We have already considered several classifiers, such as Parzen windows, that employ 
densities estimated by localized basis functions such as Gaussians. In light of our 
discussion of gradient descent and backpropagation in particular, we now turn to a 
different method for training such networks. A radial basis function network with 
linear output unit implements 


ze (x) = Y weg; (x). (56) 
j=0 


where we have included a j = 0 bias unit. If we define a vector @ whose components 
are the hidden unit outputs, and a matrix W whose entries are the hidden-to-output 
weights, then Eq. 56 can be rewritten as: z(x) = Wo. Minimizing the criterion 
function 


Tw) = 5 Dyes w) 4") (57) 


m=1 
is formally equivalent to the linear problem we saw in Chap. ??. We let T be the 
matrix consisting of target vectors and ® the matrix whose columns are the vectors 
d, then the solution weights obey 


PPW’ = O'T, (58) 


and the solution can be written directly: Wt = ®'T. Recall that Ý is the pseu- 
doinverse of ®. One of the benefits of such radial basis function or RBF networks 
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with linear output units is that the solution requires merely such standard linear tech- 
niques. Nevertheless, inverting large matrices can be computationally expensive, and 
thus the above method is generally confined to problems of moderate size. 

If the output units are nonlinear, that is, if the network implements 


ze(x) = f | Y wro) (59) 
5=0 


rather than Eq. 56, then standard backpropagation can be used. One need merely 
take derivatives of the localized transfer functions. For classification problems it is 
traditional to use a sigmoid for the output units in order to keep the output values 
restricted to a fixed range. Some of the computational simplification afforded by 
sigmoidal at the hidden units functions is absent, but this presents no conceptual 
difficulties (Problem ?7). 


6.10.2 Special bases 


Occasionally we may have special information about the functional form of the dis- 
tributions underlying categories and then it makes sense to use corresponding hidden 
unit transfer functions. In this way, fewer parameters need to be learned for a given 
quality of fit to the data. This is an example of increasing the bias of our model, and 
thereby reducing the variance in the solution, a crucial topic we shall consider again 
in Chap. ??. For instance, if we know that each underlying distribution comes from 
a mixture of two Gaussians, naturally we would use Gaussian transfer functions and 
use a learning rule that set the parameters (such as the mean and covariance). 


6.10.3 Time delay neural networks (TDNN) 


One can also incorporate prior knowledge into the network architecture itself. For 
instance, if we demand that our classifier be insensitive to translations of the pattern, 
we can effectively replicate the recognizer at all such translations. This is the approach 
taken in time delay neural networks (or TDNNs) 

Figure 6.25 shows a typical TDNN architecture; while the architecture consists 
of input, hidden and output layers, much as we have seen before, there is a crucial 
difference. Each hidden unit accepts input from a restricted (spatial) range of posi- 
tions in the input layer. Hidden units at “delayed” locations (i.e., shifted to the right) 
accept inputs from the input layer that are similarly shifted. Training proceeds as in 
standard backpropagation, but with the added constraint that corresponding weights 
are forced to have the same value — an example of weight sharing. Thus, the weights 
learned do not depend upon the position of the pattern (so long as the full pattern 
lies in the domain of the input layer). 

The feedforward operation of the network (during recognition) is the same as in 
standard three-layer networks, but because of the weight sharing, the final output 
does not depend upon the position of the input. The network gets its name from the 
fact that it was developed for, and finds greatest use in speech and other temporal 
phenomena, where the shift corresponds to delays in time. Such weight sharing can 
be extended to translations in an orthogonal spatial dimensions, and has been used 
in optical character recognition systems, where the location of an image in the input 
space is not precisely known. 


WEIGHT 
SHARING 
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Figure 6.25: A time delay neural network (TDNN) uses weight sharing to insure that 
patterns are recognized regardless of shift in one dimension; in practice, this dimension 
generally corresponds to time. In this example, there are five input units at each time 
step. Because we hypothesize that the input patterns are of four time steps or less 
in duration, each of the hidden units at a given time step accepts inputs from only 
4 x 5 = 20 input units, as highlighted in gray. An analogous translation constraint is 
also imposed between the hidden and output layer units. 


6.10.4 Recurrent networks 


Up to now we have considered only networks which use feedforward flow of information 
during classification; the only feedback flow was of error signals during training. Now 
we turn to feedback or recurrent networks. In their most general form, these have 
found greatest use in time series prediction, but we consider here just one specific 
type of recurrent net that has had some success in static classification tasks. 


Figure 6.26 illustrates such an architecture, one in which the output unit values 
are fed back and duplicated as auxiliary inputs, augmenting the traditional feature 
values. During classification, a static pattern x is presented to the input units, the 
feedforward flow computed, and the outputs fed back as auxiliary inputs. This, in 
turn, leads to a different set of hidden unit activations, new output activations, and 
so on. Ultimately, the activations stabilize, and the final output values are used for 
classification. As such, this recurrent architecture, if “unfolded” in time, is equivalent 
to the static network shown at the right of the figure, where it must be emphasized that 
many sets of weights are constrained to be the same (weight sharing), as indicated. 


This unfolded representation shows that recurrent networks can be trained via 
standard backpropagation, but with the weight sharing constraint imposed, as in 
TDNNs. 
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Figure 6.26: The form of recurrent network most useful for static classification has 
the architecture shown at the bottom, with the recurrent connections in red. It is 
functionally equivalent to a static network with many hidden layers and extensive 
weight sharing, as shown above. Note that the input is replicated. 


6.10.5 Counterpropagation 


Occasionally, one wants a rapid prototype of a network, yet one that has expressive 
power greater than a mere two-layer network. Figure 6.27 shows a three-layer net, 
which consists of familiar input, hidden and output layers.* When one is learning the 
weights for a pattern in category wi, 


In this way, the hidden units create a Voronoi tesselation (cf. Chap. ??), and the 
hidden-to-output weights pool information from such centers of Voronoi cells. The 
processing at the hidden units is competitive learning (Chap. ??). 


The speedup in counterpropagation is that only the weights from the single most 
active hidden unit are adjusted during a pattern presentation. While this can yield 
suboptimal recognition accuracy, counterpropagation can be orders of magnitude 
faster than full backpropagation. As such, it can be useful during preliminary data 
exploration. Finally, the learned weights often provide an excellent starting point for 
refinement by subsequent full training via backpropagation. 


* It is called “counterpropagation” for an earlier implementation that employed five layers with 
signals that passed bottom-up as well as top-down. 
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Figure 6.27: The simplest version of a counterpropagation network consists of three 
layers. During training, an input is presented and the most active hidden unit is 
determined. The only weights that are modified are the input-to-hidden weights 
leading to this most active hidden unit and the single hidden-to-output weight leading 
to the proper category. Weights can be trained using an LMS criterion. 


6.10.6 Cascade-Correlation 


The central notion underlying the training of networks by cascade-correlation is quite 
simple. We begin with a two-layer network and train to minimum of an LMS error. If 
the resulting training error is low enough, training is stopped. In the more common 
case in which the error is not low enough, we fix the weights but add a single hid- 
den unit, fully connected from inputs and to output units. Then these new weights 
are trained using an LMS criterion. If the resulting error is not sufficiently low, yet 
another hidden unit is added, fully connected from the input layer and to the output 
layer. Further, the output of each previous hidden unit is multiplied by a fixed weight 
of -1 and presented to the new hidden unit. (This prevents the new hidden unit from 
learning function already represented by the previous hidden units.) Then the new 
weights are trained via an LMS criterion. Thus training proceeds by alternatively 
training weights, then (if needed) adding a new hidden unit, training the new modi- 
fiable weights, and so on. In this way the network grows to a size that depends upon 
the problem at hand (Fig. 6.28). 

The benefit is that often faster than strict backprop since fewer weights are up- 
dated at any time (Computer exercise 18). 


Algorithm 4 (Cascade-correlation) 


1 begin initialize a, criterion 6,7,k — 0 

2 dom-—m+l1 

3 Whi — Wki — NV J(w) 

4 until VJ(w) =9 

5 if J(w>0then add hidden unit else exit 
6 

7 

8 


dom—m+1 
wji — wji — NV J(w); wkj — Wei — NV I(w) 
until VJ(w) ~ 0 
9 return w 
10 end 
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Figure 6.28: The training of a multilayer network via cascade-correlation begins with 
the input later fully connected to the output layer (black). Such weights, wki are 
trained using an LMS criterion, as discussed in Chap. ??. If the resulting training 
error is not sufficiently low, a first hidden unit (labeled 1, in red) is introduced, fully 
interconnected from the input layer and to the output layer. These new red weights are 
trained, while the previous (black) ones are held fixed. If the resulting training error 
is still not sufficiently low, a second hidden unit (labeled 2) is likewise introduced, 
fully interconnected; it also receives a the output from each previous hidden unit, 
multiplied by -1. Training proceeds in this way, training successive hidden units until 
the training error is acceptably low. 


6.10.7 Neocognitron 


The cognitron and its descendent, the Neocognitron, address the problem of recogni- 
tion of characters in pixel input. The networks are noteworthy not for the learning 
method, but instead for their reliance on a large number of layers for translation, scale 
and rotation invariance. 

The first layer consists of hand tuned feature detectors, such as vertical, horizon- 
tal and diagonal line detectors. Subsequent layers consist of slightly more complex 
features, such as Ts or Xx, and so forth — weighted groupings of the outputs of 
units at earlier layers. The total number of weights in such a network is enormous 
(Problem 35). 


6.11 Regularization and complexity adjustment 


Whereas the number of inputs and outputs of a backpropagation network are deter- 
mined by the problem itself, we do not know a priori the number of hidden units, 
or weights. If we have too many degrees of freedom, we will have overfitting. This 
will depend upon the number of training patterns and the complexity of the problem 
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Figure 6.29: The neocognitron consists of a 19 x 19 pixel input layer, seven interme- 
diate layers, and an output layer consisting of 10 units, one for each digit. The earlier 
layers consist of relatively fixed feature detectors (as shown); units in successively 
layer respond to a spatial range of units in the previous layer. In this way, shift, 
rotation and scale invariance is distributed throughout the network. The network is 
trained one-layer at a time by a large number of patterns. 


itself. 

We could try different numbers of hidden units, apply knowledge of the problem 
domain or add other constraints. The error is the sum of an error over patterns (such 
as we have used before) plus a regularization term, which expresses constraints or 
desirable properties of solutions: 


J = Jpat + ÀJreg. (60) 


The parameter A is adjusted to impose the regularization more or less strongly. 
Because a desirable constraint is simpler networks (i.e., simpler models), regular- 
ization is often used to adjust complexity, as in weight decay. 
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6.11.1 Complexity measurement 


XXX 


6.11.2 Wald statistics 


The fundamental theory of generalization favors simplicity. For a given level of per- 
formance on observed data, models with fewer parameters can be expected to perform 
better on test data. For instance weight decay leads to simpler decision boundaries 
(closer to linear). Likewise, training via cascade-correlation adds weights only as 
needed. 

The fundamental idea in Wald statistics is that we can estimate the importance 
of a parameter in a model, such as a weight, by how much the training error increases 
if that parameter is eliminated. To this end the Optimal Brain Damage method 
(OBD) seeks to delete weights by keeping the training error as small as possible. 
OBS extended OBD to include the off-diagonal terms in the network’s Hessian, which 
were shown to be significant and important for pruning in classical and benchmark 
problems. 

OBD and Optimal Brain Surgeon (OBS) share the same basic approach of training 
a network to (local) minimum in error at weight w*, and then pruning a weight that 
leads to the smallest increase in the training error. The predicted functional increase 
in the error for a change in full weight vector dw is: 


T 2 
e (Z) Swi ! iw ? Z .6w + O(llówI|?) , (61) 
=0 = 


where H is the Hessian matrix. The first term vanishes because we are at a local 
minimum in error; we ignore third- and higher-order terms. The general solution for 
minimizing this function given the constraint of deleting one weight is (Problem ??): 


w 
ôw =-———_ H!-u, and L= ==. (62) 
[H~"]aq i * 2 [Hg 
Here, uy is the unit vector along the qth direction in weight space and L, is the 
saliency of weight q — an estimate of the increase in training error if weight q is 


pruned and the other weights updated by the left equation in Eq. 62 (Problem 42). 


We define X; = bgla"; w) and ak = ado de E and can easily show that the 


recursion for computing the inverse Hessian becomes: 


m+1 m E 4 Ke H7! Xia > 
AH, = a™I (63) 
m =n, (64) 


where a is a small parameter — effectively a weight decay constant (Problem 38). 
Note how different error measures d(t,z) scale the gradient vectors Xy forming the 
Hessian (Eq. ??). For the squared error d(t,z) = (t — z)?, we have az = 1, and all 
gradient vectors are weighted equally. 

Problem: repeat for cross-entropy (Problem 36). 
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Figure 6.30: The saliency of a parameter, such as a weight, is the increase in the 
training error when that weight is set to zero. One can approximate the saliency by 
expanding the true error around a local minimum, w*, and setting the weight to zero. 
In this example the approximated saliency is smaller than the true saliency; this is 
typically, but not always the case. 


Figure 6.31: In the second-order approximation to the criterion function, optimal 
brain damage assumes the Hessian matrix is diagonal, while Optimal Brain Surgeon 
uses the full Hessian matrix. 


Summary 


Multilayer nonlinear neural networks — nets with two or more layers of modifiable 
weights — trained by gradient descent methods such as backpropagation perform a 
maximum likelihood estimation of the weight values (parameters) in the model defined 
by the network topology. One of the great benefits of learning in such networks is the 
simplicity of the learning algorithm, the ease in model selection, and the incorporation 
of heuristic constraints by means such as weight decay. Discrete pruning algorithms 
such as Optimal Brain Surgeon and Optimal Brain Damage correspond to priors 
favoring few weights, and can help avoid overfitting. 


6.11. SUMMARY 55 


Alternate networks and training algorithms have benefits. For instance radial basis 
functions are most useful when the data clusters. Cascade-correlation and counter- 
propagation are generally faster than backpropagation. 

Complexity adjustment: weight decay, Wald statistic, which for networks is opti- 
mal brain damage and optimal brain surgeon, which use the second-order approxima- 
tion to the true saliency as a pruning criterion. 
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Bibliographical and Historical Remarks 


McCulloch and Pitts provided the first principled mathematical and logical treatment 
of the behavior of networks of simple neurons [49]. This pioneering work addressed 
non-recurrent as well as recurrent nets (those possessing “circles,” in their termi- 
nology), but not learning. Its concentration on all-or-none or threshold function of 
neurons indirectly delayed the consideration of continuous valued neurons that would 
later dominate the field. These authors later wrote an extremely important paper 
on featural mapping (cf. Chap. ??), invariances, and learning in nervous systems and 
thereby advanced the conceptual development of pattern recognition significantly [56]. 

Rosenblatt’s work on the (two-layer) Perceptron (cf. Chap. ??) [61, 62] was some 
of the earliest to address learning, and was the first to include rigorous proofs about 
convergence. A number of stochastic methods, including Pandemonium [66, 67], were 
developed for training networks with several layers of processors, though in keeping 
with the preoccupation with threshold functions, such processors generally computed 
logical functions (AND or OR), rather than some continuous functions favored in later 
neural network research. The limitations of networks implementing linear discrimi- 
nants — linear machines — were well known in the 1950s and 1960s and discussed by 
both their promoters [62, cf., Chapter xx, “Summary of Three-Layer Series-Coupled 
Systems: Capabilities and Deficiencies”] and their detractors [51, cf., Chapter 5, 
“CONNECTED: A Geometric Property with Unbounded Order”]. 

A popular early method was to design by hand three-layer networks with fixed 
input-to-hidden weights, and then train the hidden-to-output weight [80, for a review]. 
Much of the difficulty in finding learning algorithms for all layers in a multilayer neural 
network came from the prevalent use of linear threshold units. Since these do not have 
useful derivatives throughout their entire range, the current approach of applying the 
chain rule for derivatives and the resulting “backpropagation of errors” did not gain 
more adherents earlier. 

The development of backpropagation was gradual, with several steps, not all of 
which were appreciated or used at the time. The earliest application of adaptive 
methods that would ultimately become backpropagation came from the field of con- 
trol. Kalman filtering from electrical engineering [38, 28] used an analog error (dif- 
ference between predicted and measured output) for adjusting gain parameters in 
predictors. Bryson, Denham and Dreyfus showed how Lagrangian methods could 
train multilayer networks for control, as described in [6]. We saw in the last chapter 
the work of Widrow, Hoff and their colleagues [81, 82] in using analog signals and 
the LMS training criterion applied to pattern recognition in two-layer networks. Wer- 
bos [77][78, Chapter 2], too, discussed a method for calculating the derivatives of a 
function based on a sequence of samples (as in a time series), which, if interpreted 
carefully carried the key ideas of backpropagation. Parker’s early “Learning logic” 
[53, 54], developed independently, showed how layers of linear units could be learned 
by a sufficient number of input-output pairs. This work lacked simulations on repre- 
sentative or challenging problems (such as XOR) and was not appreciated adequately. 
Le Cun independently developed a learning algorithm for three-layer networks [9, in 
French] in which target values are propagated, rather than derivatives; the resulting 
learning algorithm is equivalent to standard backpropagation, as pointed out shortly 
thereafter [10]. 

Without question, the paper by Rumelhart, Hinton and Williams [64], later ex- 
panded into a full and readable chapter [65], brought the backpropagation method to 
the attention of the widest audience. These authors clearly appreciated the power of 
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the method, demonstrated it on key tasks (such as the exclusive OR), and applied it 
to pattern recognition more generally. An enormous number of papers and books of 
applications — from speech production and perception, optical character recognition, 
data mining, finance, game playing and much more — continues unabated. One novel 
class of for such networks includes generalization for production [20, 21]. One view of 
the history of backpropagation is [78]; two collections of key papers in the history of 
neural processing more generally, including many in pattern recognition, are [3, 2]. 


Clear elementary papers on neural networks can be found in [46, 36], and several 
good textbooks, which differ from the current one in their emphasis on neural networks 
over other pattern recognition techniques, can be recommended [4, 60, 29, 27]. An 
extensive treatment of the mathematical aspects of networks, much of which is beyond 
that needed for mastering the use of networks for pattern classification, can be found 
in [19]. There is continued exploration of the strong links between networks and more 
standard statistical methods; White presents and overview [79], and books such as 
[8, 68] explore a number of close relationships. The important relation of multilayer 
Perceptrons to Bayesian methods and probability estimation can be found in [23, 
59, 43, 5, 13, 63, 52].posterior probability!and backpropagation Original papers on 
projection pursuit and MARS, can be found in [15] and [34], respectively, and a good 
overview in [60]. 


Shortly after its wide dissemination, the backpropagation algorithm was criti- 
cized for its lack of biological plausibility; in particular, Grossberg [22] discussed the 
non-local nature of the algorithm, i.e., that synaptic weight values were transported 
without physical means. Somewhat later Stork devised a local implementation of 
backpropagation was [71, 45], and pointed out that it was nevertheless highly implau- 
sible as a biological model. 


The discussions and debates over the relevance of Kolmogorov’s Theorem [39] to 
neural networks, e.g. [18, 40, 41, 33, 37, 12, 42], have centered on the expressive 
power. The proof of the univerasal expressive power of three-layer nets based on 
bumps and Fourier ideas appears in [31]. The expressive power of networks having 
non-traditional transfer functions was explored in [72, 73] and elsewhere. The fact 
that three-layer networks can have local minima in the criterion function was explored 
in [50] and some of the properties of error surfaces illustrated in [35]. 


The Levenberg-Marquardt approximation and deeper analysis of second-order 
methods can be found in [44, 48, 58, 24]. Three-layer networks trained via cascade- 
correlation have been shown to perform well compared to standard three-layer nets 
trained via backpropagation [14]. Our presentation of counterpropagation networks 
focussed on just three of the five layers in a full such network [30]. Although there 
was little from a learning theory new presented in Fukushima’s Neocognitron [16, 17], 
its use of many layers and mixture of hand-crafted feature detectors and learning 
groupings showed how networks could address shift, rotation and scale invariance. 


Simple method of weight decay was introduced in [32], and gained greater accep- 
tance due to the work of Weigend and others [76]. The method of hints was introduced 
in [1]. While the Wald test [74, 75] has been used in traditional statistical research 
[69], its application to multilayer network pruning began with the work of Le Cun 
et al's Optimal Brain Damage method [11], later extended to include non-diagonal 
Hessian matrices [24, 25, 26], including some speedup methods [70]. A good review 
of the computation and use of second order derivatives in networks can be found in 
[7] and of pruning algorithms in [58]. 
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Problems 


Q Section 6.2 


1. Show that if the transfer function of the hidden units is linear, a three-layer 
network is equivalent to a two-layer one. Explain why, therefore, that a three-layer 
network with linear hidden units cannot solve a non-linearly separable problem such 
as XOR or n-bit parity. 

2. Fourier’s Theorem can be used to show that a three-layer neural net with sigmoidal 
hidden units can approximate to arbitrary accuracy any posterior function. Consider 
two-dimensional input and a single output, z(x1, £2). Recall that Fourier’s Theorem 
states that, given weak restrictions, any such functions can be written as a possibly 
infinite sum of cosine functions, as 


2(a1,02) = Y Y Ap f,008( fir1) cos( fra), 
fi fe 


with coefficients Ay, fa- 


(a) Use the trigonometric identity 
1 1 
cosa cos3 = ¿Costa +06)+ zele — B) 


to write z(x1, #2) as a linear combination of terms cos(f 11 + fzx2) and 
cos( fıxzı — f212). 


(b) Show that cos(x) or indeed any continuous function f(x) can be approximated 
to any accuracy by a linear combination of sign functions as: 


N 
1+ Sgn(a — 2;) 
FE) x Fao) +Y en ~ fa PEPA 
i=0 
where the x; are sequential values of x; the smaller zi+}1 — zi, the better the 
approximation. 


(c) Put your results together to show that z(x1, £2) can be expressed as a linear 
combination of step functions or sign functions whose arguments are themselves 
linear combinations of the input variables x, and x2. Explain, in turn, why 
this implies that a three-layer network with sigmoidal hidden units and a linear 
output unit can implement any function that can be expressed by a Fourier 
series. 


(d) Does your construction guarantee that the derivative df (x)/dx can be well ap- 
proximated too? 
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3. Consider an d — nz — c network trained with n patterns for me epochs. 


(a) What is the space complexity in this problem? (Consider both the storage of 
network parameters as well as the storage of patterns, but not the program 
itself.) 
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(b) Suppose the network is trained in stochastic mode. What is the time complex- 
ity? Since this is dominated by the number of multiply-accumulations, use this 
as a measure of the time complexity. 


(c) Suppose the network is trained in batch mode. What is the time complexity? 


4. Prove that the formula for the sensitivity 6 for a hidden unit in a three-layer net 
(Eq. 20) generalizes to a hidden unit in a four- (or higher-) layer network, where the 
sensitivity is the weighted sum of sensitivities of units in the next higher layer. 

5. Explain in words why the backpropagation rule for training input-to-hidden 
weights makes intuitive sense by considering the dependency upon each of the terms 
in Eq. 20. 

6. One might reason that the the dependence of the backpropagation learning rules 
(Eq. ??) should be roughly inversely related to f’ (net); i.e., that weight change should 
be large where the output does not vary. In fact, of course, the learning rule is linear 
in f’(net). What, therefore, is wrong with the above view? 

7. Show that the learning rule described in Eqs. 16 & 20 works for bias, where 
To = Yo = 1 is treated as another input and hidden unit. 

8. Consider a standard three-layer backpropagation net with d input units, ny 
hidden units, c output units, and bias. 


(a) How many weights are in the net? 


(b) Consider the symmetry in the value of the weights. In particular, show that if 
the sign if flipped on every weight, the network function is unaltered. 


(c) Consider now the hidden unit exchange symmetry. There are no labels on 
the hidden units, and thus they can be exchanged (along with corresponding 
weights) and leave network function unaffected. Prove that the number of such 
equivalent labellings — the exchange symmetry factor — is thus ng2"*. Eval- 
uate this factor for the case ny = 10. 


9. Using the style of procedure, write the procedure for on-line version of backpropa- 

gation training, being careful to distinguish it from stochastic and batch procedures. 
10. Express the derivative of a sigmoid in terms of the sigmoid itself in the following 
two cases (for positive constants a and b): 


(a) A sigmoid that is purely positive: f(net) = Te 


(b) An anti-symmetric sigmoid: f(net) = atanh(b net). 


11. Generalize the backpropagation to four layers, and individual (smooth, differ- 
entiable) transfer functions at each unit. In particular, let x;, yj, vı and z, denote 
the activations on units in successive layers of a four-layer fully connected network, 
trained with target values tz. Let fı; be the transfer function of unit 7 in the first 
layer, fa; in the second layer, and so on. Write a program, with greater detail than 
that of Algorithm 1, showing the calculation of sensitivities, weight update, etc. for 
the general four-layer network. 
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12. Use Eq. ?? to show why the input-to-hidden weights must be different from each 
other (e.g., random) or else learning cannot proceed well (cf. Computer Exercise 2). 
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13. Show that proper preprocessing of the data will lead to faster convergence, at 
least in a simple network 2-1 (two-layer) network with bias. Suppose the training 
data come from two Gaussians, p(z|w,) ~ N(—.5,1) and plxlwa) ~ N(+.5,1). Let 
the teaching values for the two categories be t = +1. 


(a) Write the error as a sum over the n patterns of a function of the weights, inputs, 
etc. 


(b) Differentiate twice with respect to the weights to get the Hessian H. Express 
your answer in words as well. 


(c) Consider two data sets drawn from p(x|w;) ~ N(p,, 1) for i = 1,2 and I is the 
2 x 2 identity matrix. Calculate your Hessian in terms of p;. 


(d) Calculate the maximum and minimum eigenvalues of the Hessian in terms of 
the components of p;. 


(e) Suppose ya, = (1,0)* and ps = (0,1). Calculate the ratio of the eigenvalues, 
and hence a measure of the convergence time. 


(£) Now standardize your data, by subtracting means and scaling to have unit 
covariances in each of the two dimensions. That is, find two new distributions 
that have overall zero mean and the same covariance. Check your answer by 
calculating the ratio of the maximum to minimum eigenvalues. 


(g) If T denotes the total training time in the unprocessed data, express the time 
required for the preprocessed data (cf. Computer exercise 13). 


14. Consider the derivation of the bounds on the convergence time for gradient 
descent. Complete the steps leading to Eq. ?? as follows: 


(a) Express the error to second order in new coordinates w that are parallel to the 
principal axes of the Hessian. 


(b) Write an equation analogous to that of Eq. ?? in the transformed space. Use A 
as the diagonal matrix of eigenvalues of the Hessian. 


(c) Inspect your result and use Eq. ?? to state a criterion for convergence in terms 
of Amax, the maximum eigenvalue of the Hessian. 


15. Assume that the criterion function J(w) is well described to second order by a 
Hessian matrix H. 


(a) Show that convergence of learning is assured if the learning rate obeys y < 
2/Xmax, Where Amaz is the largest eigenvalue of H. 


(b) Show that the learning time is thus dependent upon the ratio of the largest to 
the smallest non-negligible eigenvalue of H. 


(c) Explain why “standardizing” the training data can therefore reduce learning 
time. 
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16. Problem on feature mapping. xx 


@_ Section 6.6 


17. Fill in the steps in the derivation leading to Eq. 25. 

18. Consider Eq. 27, and confirm that one of the solutions to the minimum squared 
error condition yields outputs that are indeed posterior probabilities. Do this as 
follows: 


(a) To find the minimum of J(w), calculate its derivative OJ (w)/Ow; this will 
consist of the sum of two integrals. Set 0J(w)/Ow = 0 and solve to obtain the 
natural solution. 


(b) Apply Bayes’ rule and the normalization P(w;|x) + P(wiz,|x) = 1 to prove 
that the outputs zk = gk(x; w) are indeed equal to the posterior probabilities 
P(w |x). 


19. In the derivation that backpropagation finds a least squares fit to the posterior 
probabilities, it was implicitly assumed that the network could indeed represent the 
true underlying distribution. Explain where in the derivation this was assumed, and 
what in the subsequent steps may not hold if that assumption is violated. 

20. Show that the softmax output (Eq. 29) indeed approximates posterior probabil- 
ities if the hidden unit outputs, y, belong to the family of exponential distributions 
as: 


p(y |wk) = exp[A(we) + Bly, $) + Why] 


for ny-dimensional vectors W¿ and y, and scalar $ and scalar functions A(-) and 
B(.,-). Proceed as follows: 


(a) Given p(y|w,), use Bayes’ Theorem to write the posterior probability P(w,|ly). 
(b) Interpret the parameters A(-), wz, B(-,-) and ¢ in light of your results. 


21. Consider a three-layer network for classification with output units employing 
softmax (Eq. 29), trained with 0 — 1 signals. 


(a) Derive the learning rule if the criterion function (per pattern) is sum squared 
error, i.e., 


c 


k=1 


(b) Repeat for the criterion function is cross-entropy, i.e., 
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22. Clearly if the discriminant functions gy, (x; w) and gy, (x; w) were independent, 
the derivation of Eq. 26 would follow from Eq. 27. Show that the derivation is never- 
theless valid despite the fact that these functions are implemented in part using the 
same input-to-hidden weights. 
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23. Show that the slope of the sigmoid and the learning rates together determine 
the learning time. 


(a) That is, show that if the slope of the sigmoid is increased by a factor of y, and 
the learning rate decreased by a factor 1/y, that the total learning time remains 
the same. 


(b) Must the input be rescaled for this relationship to hold? 


24. Show that the basic three-layer neural networks of Sect. 6.2 are special cases of 
general additive models by describing in detail the correspondences between Eqs. 6 dz 
31. 

25. Show that the sigmoidal transfer function acts to transmit the maximum infor- 
mation if its inputs are distributed normally. Recall that the entropy (a measure of 
information) is defined as H = f p(y)ln[p(y)]dy. 


(a) Consider a continuous input variable x drawn from the density p(x) ~ N (0, o°). 
What is entropy for this distribution? 


(b) Suppose samples x are passed through an antisymmetric sigmoidal function to 
give y = f(x), where the zero crossing of the sigmoid occurs at the peak of the 
Gaussian input, and the effective width of the linear region equal to the range 
—o < x < +0. What are the values of a and b in Eq. 33 insures this? 


(c) Calculate the entropy of the output distribution p(y). 


(d) Suppose instead that the transfer function were a Dirac delta function $(x — 0). 
What is the entropy of the resulting output distribution p(y)? 


(e) Summarize your results of (c) and (d) in words. 
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26. Consider the sigmoidal transfer function: 


E e? net 2a 
f(net) = a tanh(b net) = a E mar =] EErEE 


(a) Show that its derivative f'(net) can be written simply in terms of f (net) itself. 
(b) What are f(net), f'(net) and f” (net) at net = —co? 0? +00? 
(c) For which value of net is the second derivative f”(net) extremal? 


27. Consider the computational burden for standardizing data, as described in the 
text. 
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(a) What is the computational complexity of standardizing a training set of n d- 
dimensional patterns? 


(b) Estimate the computational complexity of training. Use the heuristic for choos- 
ing the size of the network (i.e., number of weights) described in Sect. 6.8.7. 
Assume that the number of training epochs is nd. 


(c) Use your results from (a) and (b) to express the computational burden of stan- 
dardizing as a ratio. (Assume unknown constants are 1.) 


28. Derive the gradient descent learning rule for a three-layer network with linear in- 
put units and sigmoidal hidden and output units for the Minkowski xxx and arbitrary 
R. Confirm that your answer reduces to Eqs. 16 & 20 for R = 2. 

29. Training rule for polynomial classifier. Show that terms can become extremely 
large for realistic values of the input. 

30. Train in noise. show improves bp under realistic assumptions; not so nearest 
neighbor 
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31. Derive the exact expression for the full Hessian matrix for a sum squared error 
criterion in a three-layer network, as given in Eqs. 44 — 46. 

32. Repeat Problem 31 but for a cross entropy error criterion. 

33. Calculate a Hessian, see if it shrinks any vector. (Convergence assured.) 

34. Derive Eq. 51 from the discussion in the text. 
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35. What is the space complexity of the Neocognitron network of Fig. 6.29? If 
we used the heuristic of Sec. 6.8.7, how many training patterns would be needed? 
(In practice, since many weights are hand set in the form of feature detectors, fewer 
training patterns are needed.) 

36. Derive the central equations for OBD and OBS in a three-layer sigmoidal network 
for a cross-entropy error. 


@®_ Section 6.11 
37. Consider a general constant matrix K and variable vector parameter x. 
(a) Write in summation notation with components explicit, and derive the formula 
for the derivative: 
d 


q Kx] = (K + K’)x. 


(b) Show simply that for the case where K is symmetric (as for instance the Hessian 
matrix H = Ht), we have: 


= [x’Hx] = 2Hx 
dx 


as was used in Eq. ?? and in the derivation of the Optimal Brain Surgeon 
method. 
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38. Show that the constant a in the OBS derivation (Eq. ??) is equivalent to a 
weight decay. 
39. 


(a) Find the space and the time computational complexities for one step in the 
nominal OBS method. 


(b) Find the space and the time computational complexities for pruning the first 
weight in OBS. What is it for pruning subsequent weights, if one uses Shur’s 
decomposition method? 


(c) Find the space and the time computational complexities for one step of OBD 
(without retraining). 


40. Weight decay is equivalent to doing gradient descent on an error that has a 
“complexity” term. 


n 


(a) Show that in the weight decay rule 1, = w@i4(1 — €) amounts to performing 


j 
gradient descent in the error function Jef = J (w) + ww (Eq.38). 


(b) Express y in terms of the weight decay constant e and learning rate n. 


(c) Likewise, show that if wre’ = w!4(1 — emr) where emr = 1/(1 + w?,,.)?, that 


the new effective error function is Jeg = J(w) + y) w2,,/(1+uz2,,). Find y in 
mr 
terms of 7 and Emr- 


(d) Consider a network with a wide range of magnitudes for weights. Describe 
qualitatively how the two different weight decay methods affect the network. 


41. Show that the weight decay rule of Eq. 37 is equivalent to a prior on models 
that favors small weights. 


EH Section ?? 
42. 
(a) Fill in the steps between Eq. ?? and ?? for the saliency. 


(b) Find the saliency in OBD, where one assumes H;,; = 0 for ¡A j. 
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43. Prove that the weight decay rule of Eq. 37 leads to the Jreg of Eq. 38. 


Computer exercises 


Several exercises will make use of the following three-dimensional data sampled from 
three categories, denoted w;. (CHANGE NUMBERS xxx) 
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Wy wa W3 
sample Tı T2 T3 Ti La X3 Tı Ta T3 
1 0.28 1.31 -6.2 0.011 1.03 -0.21 | 1.36 2.17 0.14 
2 0.07 0.58 -0.78 | 1.27 1.28 0.08 | 1.41 1.45 -0.38 
3 1.54 2.01 -1.63 | 0.13 3.12 0.16 | 1.22 0.99 0.69 
4 -0.44 1.18 -4.32 | -0.21 1.23 -0.11 | 2.46 2.19 1.31 
5 -0.81 0.21 5.73 | -2.18 1.39 -0.19 | 0.68 0.79 0.87 
6 1.52 3.16 2.77 | 0.34 1.96 -0.16 | 2.51 3.22 1.35 
4 2.20 2.42 -0.19 | -138 0.94 0.45 | 060 2.44 0.92 
8 0.91 1.94 6.21 |-0.12 082 0.17 | 064 0.13 0.97 
9 0.65 1.93 4.38 | -144 2.31 0.14 | 0.85 0.58 0.99 
10 -0.26 0.82 -0.96 | 0.26 1.94 0.08 | 0.66 0.51 0.88 
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1. Consider a 2-2-1 network with bias, where the transfer function at the hidden 
units and the output unit is a sigmoid y; = a tanh[b net;] for a = 1.716 and b = 2/3. 
Suppose the matrices describing the input-to-hidden weights (wj; for j = 1,2 and 
i = 0,1,2) and the hidden-to-output weights (wg; for k = 1 and j = 0,1,2) are, 
respectively, 


XX XX XX 
XX XX and XX 
XX XX XX 


The network is to be used to classify patterns into one of two categories, based on 
the sign of the output unit signal. Shade a two-dimensional input space 11 — x2 
(—5 < 21,12 < +5) black or white according to the category given by the network. 
Repeat with 


XX XX XX 
XX XX and XX 
XX XX XX 


XXX 
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2. Create a 3-1-1 sigmoidal network with bias to be trained to classify patterns from 
wı and wa in the table above. Use stochastic backpropagation to (Algorithm 1) with 
learning rate 7 = 0.1 and sigmoid as described in Eq. 33 in Sect. 6.8.2. 


(a) Initialize all weights randomly in the range —1 < w < +1. Plot a learning curve 
— the training error as a function of epoch. 


(b) Now repeat (a) but with weights initialized to be the same throughout each 
level. In particular, let all input-to-hidden weights be initialized with wj; = 0.5 
and all hidden-to-output weights with wkj = —0.5. 


(c) Explain the source of the differences between your learning curves (cf. Prob- 
lem 12). 
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3. Consider the nonlinearly separable categorization problem shown in Fig. 6.8. 


(a) Train a 1-3-1 sigmoidal network with bias by means of batch backpropagation 
(Algorithm 2) to solve it. 


(b) Display your decision boundary by classifying points along separated by Az ~ 
0.1. 


(c) Repeat with a 1-2-1 network. 


(d) Inspect the decision boundary for your 1-3-1 network (or construct by hand an 
optimal one) and explain why no 1-2-1 network with sigmoidal hidden units can 
achieve it. 


4. Write a backpropagation program for a 2-2-1 network with bias to solve the XOR 
problem (see Fig. 6.1). Show the input-to-hidden weights and analyze the function of 
each hidden unit. 

5. Write a basic backpropagation program for a 3-3-1 network with bias to solve the 
three-bit parity problem, i.e., return a +1 if the number of input units that are high 
is even, and -1 if odd. 


(a) Show the input-to-hidden weights and analyze the function of each hidden unit. 


(b) Retrain several times from a new random point until you get a local (but not 
global) minimum. Analyze the function of the hidden units now. 


(c) How many patterns are properly classified for your local minimum? 


6. Write a stochastic backpropagation program for a 2 — ny — 1 network with bias 
to classify points chosen randomly in the range —1 < 21,22 < +1 with P(w1) = 
P(w) = 0.5. Train using 40 points (20 from each category). Train with ng = 1, 2,3 
and 4 hidden units. Plot your minimum training error as a function of ng. How 
many hidden units are needed to implement your particular random function? 

7. Train a 2-4-1 network having a different transfer function at each hidden unit on 


a random problem. 
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8. Measure H, show that convergence is slower for Nope < N < 2Nopt- 
HB Section 6.5 


9. Train net and show that the hidden 
HB Section 6.6 


10. Three-layer with softmax outputs. 
11. Train with one set of priors; test with other priors. 


HB Section 6.7 


12. Consider several gradient descent methods applied to a criterion function in one 
dimension: simple gradient descent with learning rate 7, optimized descent, Newton's 
method, and Quickprop. Consider first the criterion function J(w) = w? which of 
course has minimum J = 0 at w = 0. In all cases, start the descent at w(0) = 1. For 
definiteness, we consider convergence to be complete when 
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(a) Plot the number of steps until convergence as a function of 7 for 7 = 0.01, 0.03, 0.1, 0.3, 1,3. 


(b) Calculate the optimum learning rate Nop+ by Eq. 35, and confirm that this value 
is correct from your graph in (??). 


(c) 
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13. Demonstrate that preprocessing data can lead to significant reduction in time of 
learning. Consider a single linear output unit for a two-category classification task, 
with teaching values t,, = +1,t,, = — 1, with squared error criterion. 


(a) Write a program to train the three weights based on training samples. 


(b) Generate 20 samples from each of two categories P(w1) = P(w2) = .5 and 
p(x|w;) ~ N(p,), I, where I is the 2 x 2 identity matrix and p, = (??,??)' and 
pa = (22, 27) 

(c) Find the optimal learning rate empirically by trying a few values. 


(d) Train to minimum error. Why is there no danger of overtraining in this case? 


(e) Why can we be sure that it is at least possible that this network can achieve 
the minimum (Bayes) error? 


(f) Generate 100 test samples, 50 from each category, and find the error rate. 


(g) Now preprocess the data by subtracting off the mean and scaling standard 
deviation in each dimension. 


(h) Repeat the above, and find the optimal learning rate. 
(i) Find the error rate on the (transformed) test set. 


(j) Verify that the accuracy is virtually the same in the two cases (any differences 
can be attributed to stochastic effects). 


k) Explain in words the underlying reasons for your results. 
g y 


14. global vs. local representations 
15. standardize the input 
16. problem with hints 


Q Section 6.9 


17. Train with Hessian near identity, train with it far from identity. 


HB Section 6.10 


18. Compare cascade-correlation to backprop. 


HB Section 6.11 


XXX 


Q Section ?? 
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Chapter 7 


Stochastic Methods 


7.1 Introduction 


earning plays a central role in the construction of pattern classifiers. As we have 
L seen, the general approach is to specify a model having one or more parameters 
and then estimate their values from training data. When the models are fairly simple 
and of low dimension, we can use analytic methods such as computing derivatives 
and performing gradient descent to find optimal model parameters. If the models 
are somewhat more complicated, we may calculate local derivatives and use gradient 
methods, as in neural networks and some maximum-likelihood problems. In most 
high-dimensional and complicated models, there are multiple maxima and we must 
use a variety of tricks — such as performing the search multiple times from different 
starting conditions — to have any confidence that an acceptable local maximum has 
been found. 

These methods become increasingly unsatisfactory as the models become more 
complex. A naive approach — exhaustive search through solution space — rapidly 
gets out of hand and is completely impractical for real-world problems. The more 
complicated the model, the less the prior knowledge, and the less the training data, the 
more we must rely on sophisticated search for finding acceptable model parameters. In 
this chapter we consider stochastic methods for finding parameters, where randomness 
plays a crucial role in search and learning. The general approach is to bias the search 
toward regions where we expect the solution to be and allow randomness — somehow 
— to help find good parameters, even in very complicated models. 

We shall consider two general classes of such methods. The first, exemplified by 
Boltzmann learning, is based on concepts and techniques from physics, specifically 
statistical mechanics. The second, exemplified by genetic algorithms, is based on 
concepts from biology, specifically the mathematical theory of evolution. The former 
class has a highly developed and rigorous theory and many successes in pattern recog- 
nition; hence it will command most of our effort. The latter class is more heuristic 
yet affords flexibility and can be attractive when adequate computational resources 
are available. We shall generally illustrate these techniques in cases that are simple, 
and which might also be addressed with standard gradient procedures; nevertheless 
we emphasize that these stochastic methods may be preferable in complex problems. 
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4 CHAPTER 7. STOCHASTIC METHODS 


The methods have high computational burden, and would be of little use without 
computers. 


7.2 Stochastic search 


We begin by discussing an important and general quadratic optimization problem. 
Analytic approaches do not scale well to large problems, however, and thus we focus 
here on methods of search through different candidate solutions. We then consider a 
form of stochastic search that finds use in learning for pattern recognition. 

Suppose we have a large number of variables s;, i = 1,..., N where each variable 
can take one of two discrete values, for simplicity chosen to be +1. The optimization 
problem is this: find the values of the s; so as to minimize the cost or energy 


1 N 
E= -3 5 WijSiSj, (1) 


ij=1 


where the wij can be positive or negative and are problem dependent. We require 
the self-feedback terms to vanish, i.e., w;; = 0, since non-zero w;; merely add an 
unimportant constant to E, independent of the s;. This optimization problem can be 
visualized in terms of a network of nodes, where bi-directional links or interconnec- 
tions correspond to the weights w;; = wji. (It is very simple to prove that we can 
always replace a non-symmetric connection matrix by its symmetric part, as asked in 
Problem 2. We avoid non-symmetric matrices because they unnecessarily complicate 
the dynamics described in Sect. 7.2.1.) Figure 7.1 shows such a network, where nodes 
are labeled input, output, and hidden, though for the moment we shall ignore such 
distinctions. 

This network suggests a physical analogy which in turn will guide our choice of 
solution method. Imagine the network represents N physical magnets, each of which 
can have its north pole pointing up (s; = +1) or pointing down (s; = —1). The wij 
are functions of the physical separations between the magnets. Each pair of magnets 
has an associated interaction energy which depends upon their state, separation and 
other physical properties: Fi; = —1/2 w,¿s¡sj. The energy of the full system is the 
sum of all these interaction energies, as given in Eq. 1. The optimization task is to 
find the configuration of states of the magnets with the most stable configuration, the 
one with lowest energy. This general optimization problem appears in a wide range of 
applications, in many of which the weights do not have a physical interpretation.* As 
mentioned, we shall be particularly interested in its application to learning methods. 

Except for very small problems or few connections, it is infeasible to solve directly 
for the N values s; that give the minimum energy — the space has 2% possible 
configurations (Problem 4). It is tempting to propose a greedy algorithm to search 
for the optimum configuration: Begin by randomly assigning a state to each node. 
Next consider each node in turn and calculate the energy with it in the s; = +1 state 
and then in the s; = —1 state, and choose the one giving the lower energy. (Naturally, 
this decision needs to be based on only those nodes connected to node i with non-zero 
weight wij.) Alas, this greedy search is rarely successful since the system usually 
gets caught in local energy minima or never converges (Computer exercise 1). 

Another method is required. 


* Similar generalized energy functions, called Lyapunov functions or objective functions, can be used 
for finding optimum states in other problem domains as well. 
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Figure 7.1: The class of optimization problems of Eq. 1 can be viewed in terms of 
a network of nodes or units, each of which can be in the s; = +1 or s; = —1 state. 
Every pair of nodes i and j is connected by bi-directional weights w;j; if a weight 
between two nodes is zero then no connection is drawn. (Because the networks we 
shall discuss can have an arbitrary interconnection, there is no notion of layers as in 
multilayer neural networks.) The optimization problem is to find a configuration (i.e., 
assignment of all s;) that minimizes the energy described by Eq. 1. The state of the 
full network is indexed by an integer y, and since here there are 17 binary nodes, y 
is bounded 0 < 7 < 217. The state of the visible nodes and hidden nodes are indexed 
by a and B, respectively and in this case are bounded 0 < a < 2!° and 0 < 8 < 27. 


7.2.1 Simulated annealing 


In physics, the method for allowing a system such as many magnets or atoms in an al- 
loy to find a low-energy configuration is based on annealing. In physical annealing the 
system is heated, thereby conferring randomness to each component (magnet). As a 
result, each variable can temporarily assume a value that is energetically unfavorable 
and the full system explores configurations that have high energy. Annealing proceeds 
by gradually lowering the temperature of the system — ultimately toward zero and 
thus no randomness — so as to allow the system to relax into a low-energy config- 
uration. Such annealing is effective because even at moderately high temperatures, 
the system slightly favors regions in the configuration space that are overall lower in 
energy, and hence are more likely to contain the global minimum. As the temperature 
is lowered, the system has increased probability of finding the optimum configuration. 
This method is successful in a wide range of energy functions or energy “landscapes,” 
though there are pathological cases such as the “golf course” landscape in Fig. 7.2 
where it is unlikely to succeed. Fortunately, the problems in learning we shall consider 
rarely involve such pathological functions. 


7.2.2 The Boltzmann factor 


The statistical properties of large number of interacting physical components at a 
temperature T, such as molecules in a gas or magnetic atoms in a solid, have been 
thoroughly analyzed. A key result, which relies on a few very natural assumptions, is 
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E 


Figure 7.2: The energy function or energy “landscape” on the left is meant to suggest 
the types of optimization problems addressed by simulated annealing. The method 
uses randomness, governed by a control parameter or “temperature” T to avoid getting 
stuck in local energy minima and thus to find the global minimum, like a small ball 
rolling in the landscape as it is shaken. The pathological “golf course” landscape 
at the right is, generally speaking, not amenable to solution via simulated annealing 
because the region of lowest energy is so small and is surrounded by energetically 
unfavorable configurations. The configuration space of the problems we shall address 
are discrete and thus the continuous x; — x2 space shown here is a bit misleading. 


the following: the probability the system is in a (discrete) configuration indexed by y 
having energy E, is given by 


eE, /T 
PO) = Za (2) 


where Z is a normalization constant. The numerator is the Boltzmann factor and the 
denominator the partition function, the sum over all possible configurations 


Aes RE, (3) 


which guarantees Eq. 2 represents a true probability.* The number of configurations 
is very high, 2%, and in physical systems Z can be calculated only in simple cases. 
Fortunately, we need not calculate the partition function, as we shall see. 

Because of the fundamental importance of the Boltzmann factor in our discus- 
sions, it pays to take a slight detour to understand it, at least in an informal way. 
Consider a different, but nontheless related system: one consisting of a large number 
of non-interacting magnets, that is, without interconnecting weights, in a uniform 
external magnetic field. If a magnet is pointing up, s; = +1 (in the same direction 
as the field), it contributes a small positive energy to the total system; if the magnet 
is pointing down, a small negative energy. The total energy of the collection is thus 
proportional to the total number of magnets pointing up. The probability the system 
has a particular total energy is related to the number of configurations that have 
that energy. Consider the highest energy configuration, with all magnets pointing 
up. There is only (x) = 1 configuration that has this energy. The next to highest 


* In the Boltzmann factor for physical systems there is a “Boltzmann constant” which converts a 
temperature into an energy; we can ignore this factor by scaling the temperature in our simulations. 
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energy comes with just a single magnet pointing down; there are (7) = N such con- 
figurations. The next lower energy configurations have two magnets pointing down; 
there are (7) = N(N —1)/2 of these configurations, and so on. The number of states 
declines exponentially with increasing energy. Because of the statistical independence 
of the magnets, for large N the probability of finding the state in energy E also 
decays exponentially (Problem 7). In sum, then, the exponential form of the Boltz- 
mann factor in Eq. 2 is due to the exponential decrease in the number of accessible 
configurations with increasing energy. Further, at high temperature there is, roughly 
speaking, more energy available and thus an increased probability of higher-energy 
states. This describes qualitatively the dependence of the probability upon T in the 
Boltzmann factor — at high T, the probability is distributed roughly evenly among all 
configurations while at low T', it is concentrated at the lowest-energy configurations. 

If we move from the collection of independent magnets to the case of magnets 
interconnected by weights, the situation is a bit more complicated. Now the energy 
associated with a magnet pointing up or down depends upon the state of others. 
Nonetheless, in the case of large N, the number of configurations decays exponentially 
with the energy of the configuration, as described by the Boltzmann factor of Eq. 2. 


Simulated annealing algorithm 


The above discussion and the physical analogy suggest the following simulated an- 
nealing method for finding the optimum configuration to our general optimization 
problem. Start with randomized states throughout the network, s;(1), and select a 
high initial “temperature” T(1). (Of course in the simulation T is merely a control 
parameter which will control the randomness; it is not a true physical temperature.) 
Next, choose a node ¿ randomly. Suppose its state is s; = +1. Calculate the system 
energy in this configuration, Fa; next recalculate the energy, Ej, for a candidate new 
state s; = — 1. If this candidate state has a lower energy, accept this change in 
state. If however the energy is higher, accept this change with a probability equal to 


Abr, (4) 


where AE, = Ep— Ea. This occasional acceptance of a state that is energetically less 
favorable is crucial to the success of simulated annealing, and is in marked distinc- 
tion to naive gradient descent and the greedy approach mentioned above. The key 
benefit is that it allows the system to jump out of unacceptable local energy minima. 
For example, at very high temperatures, every configuration has a Boltzmann factor 
ee % roughly equal. After normalization by the partition function, then, every 
configuration is roughly equally likely. This implies every node is equally likely to be 
in either of its two states (Problem 6). 

The algorithm continues polling (selecting and testing) the nodes randomly several 
times and setting their states in this way. Next lower the temperature and repeat the 
polling. Now, according to Eq. 4, there will be a slightly smaller probability that a 
candidate higher energy state will be accepted. Next the algorithm polls all the nodes 
until each node has been visited several times. Then the temperature is lowered 
further, the polling repeated, and so forth. At very low temperatures, the probability 
that an energetically less favorable state will be accepted is small, and thus the search 
becomes more like a greedy algorithm. Simulated annealing terminates when the 
temperature is very low (near zero). If this cooling has been sufficiently slow, the 
system then has a high probability of being in a low energy state — hopefully the 
global energy minimum. 
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Because it is the difference in energies between the two states that determines 
the acceptance probabilities, we need only consider nodes connected to the one being 
polled — all the units not connected to the polled unit are in the same state and 
contribute the same total amount to the full energy. We let M; denote the set of 
nodes connected with non-zero weights to node 7; in a fully connected net would 
include the complete set of N — 1 remaining nodes. Further, we let Rand[0, 1) denote 
a randomly selected positive real number less than 1. With this notation, then, the 
randomized or stochastic simulated annealing algorithm is: 


Algorithm 1 (Stochastic simulated annealing) 


1 begin initialize T(k), kmaz, $:(1), wi; for i,j =1,...,N 


2 k=0 
3 do k=k>+1 
4 do select node ¿ randomly; suppose its state is s; 
Ni 
5 E, — —1/2 X wij sis; 
j 
6 E; — — Ea 
7 if E < Ea 
8 then 5; —s; 
9 else if e~(¥»—«)/T(*) > Rand[0, 1) 
10 then s; —8; 
11 until all nodes polled several times 
12 until k = kmax or stopping criterion met 
13 return E, s;,fori=1,...,N 
14 end 


Because units are polled one at a time, the algorithm is occasionally called sequential 
simulated annealing. Note that in line 5, we define E, based only on those units 
connected to the polled one — a slightly different convention than in Eq. 1. Changing 
the usage in this way has no effect, since in line 9 it is the difference in energies that 
determines transition probabilities. 

There are several aspects of the algorithm that must be considered carefully, in 
particular the starting temperature, ending temperature, the rate at which the tem- 
perature is decreased and the stopping criterion. This function is called the cooling 
schedule or more frequently the annealing schedule, T (k), where k is an iteration in- 
dex. We demand T (1) to be sufficiently high that all configurations have roughly equal 
probability. This demands the temperature be larger than the maximum difference 
in energy between any configurations. Such a high temperature allows the system to 
move to any configuration which may be needed, since the random initial configura- 
tion may be far from the optimal. The decrease in temperature must be both gradual 
and slow enough that the system can move to any part of the state space before being 
trapped in an unacceptable local minimum, points we shall consider below. At the 
very least, annealing must allow N/2 transitions, since a global optimum never differs 
from any configuration by more than this number of steps. (In practice, annealing 
can require polling several orders of magnitude more times than this number.) The 
final temperature must be low enough (or equivalently kya, must be large enough or 
a stopping criterion must be good enough) that there is a negligible probability that 
if the system is in a global minimum it will move out. 

Figure 7.3 shows that early in the annealling process when the temperature is 
high, the system explores a wide range of configurations. Later, as the temperature 
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is lowered, only states “close” to the global minimum are tested. Throughout the 
process, each transition corresponds to the change in state of a single unit. 

A typical choice of annealing schedule is T(k + 1) = cT(k) with 0 < c < 1. If 
computational resources are of no concern, a high initial temperature, large c < 1, 
and large kmax are most desirable. Values in the range 0.8 < c < 0.99 have been 
found to work well in many real-world problems. In practice the algorithm is slow, 
requiring many iterations and many passes through all the nodes, though for all but 
the smallest problems it is still faster than exhaustive search (Problem 5). We shall 
revisit the issue of parameter setting in the context of learning in Sect. 7.3.4. 

While Fig. 7.3 displayed a single trajectory through the configuration space, a more 
relevant property is the probability of being in a configuration as the system is annealed 
gradually. Figure 7.4 shows such probability distributions at four temperatures. Note 
especially that at the final, low temperature the probability is concentrated at the 
global minima, as desired. While this figure shows that for positive temperature all 
states have a non-zero probability of being visited, we must recognize that only a 
small fraction of configurations are in fact visited in any anneal. In short, in the 
vast majority of large problems, annealing does not require that all configurations be 
explored, and hence it is more efficient than exhaustive search. 


7.2.3 Deterministic simulated annealing 


Stochastic simulated annealing is slow, in part because of the discrete nature of the 
search through the space of all configurations, i.e., an N-dimensional hypercube. Each 
trajectory is along a single edge, thereby missing full gradient information that would 
be provided by analog state values in the “interior” of the hypercube. An alternate, 
faster method is to allow each node to take on analog values during search; at the 
end of the search the values are forced to be s; = +1, as required by the optimization 
problem. Such a deterministic simulated annealing algorithm also follows from the 
physical analogy. Consider a single node (magnet) i connected to several others; each 
exerts a force tending to point node i up or down. In deterministic annealing we sum 
the forces and give a continuous value for s;. If there is a large “positive” force, then 
si © +1; if a large negative force, then s; ~ —1. In the general case s; will lie between 
these limits. 

The value of s; also depends upon the temperature. At high T (large randomness) 
even a large upward force will not be enough to insure s; = +1, whereas at low 


temperature it will. We let l; => wi;s; be the force exerted on node i, the updated 
Jj 


value is: 


Si = f(li, T) = tanh{I; /T], (5) 


where there is an implied scaling of the force and temperature in the response function 
FC,-) (Fig. 7.5). In broad overview, deterministic annealing consists in setting an 
annealing schedule and then at each temperature finding an equilibrium analog value 
for every s;. This analog value is merely the expected value of the discrete s; in a 
system at temperature T (Problem 8). At low temperatures (i.e., at the end of the 
anneal), each variable will assume an extreme value +1, as can be seen in the low-T’ 
curve in Fig. 7.5. 

It is instructive to consider the energy landscape for the continuous case. Differ- 
entiation of Eq. 1 shows that the energy is linear in each variable when others held 
fixed, as can be seen in Fig. 7.6 — there are no local minima along any “cut” parallel 
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T(k) 
(Y)A 


begin 


end 


Figure 7.3: Stochastic simulated annealing (Algorithm 1) uses randomness, governed 
by a control parameter or “temperature” T(k) to search through a discrete space 
for a minimum of an energy function. In this example there are N = 6 variables; 
the 2° = 64 configurations are shown at the bottom along as a column of + and -. 
The plot of the associated energy of each configuration given by Eq. 1 for randomly 
chosen weights. Every transition corresponds to the change of just a single s;. (The 
configurations have been arranged so that adjacent ones differ by the state of just 
a single node; nevertheless most transitions corresponding to a single node appear 
far apart in this ordering.) Because the system energy is invariant with respect 
to a global interchange s; +> —s;, there are two “global” minima. The graph at 
the upper left shows the annealing schedule — the decreasing temperature versus 
iteration number k. The middle portion shows the configuration versus iteration 
number generated by Algorithm 1. The trajectory through the configuration space 
is colored red for transitions that increase the energy; late in the annealing such 
energetically unfavorable (red) transitions are rarer. The graph at the right shows 
the full energy E(k), which decreases to the global minimum. 
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Figure 7.4: An estimate of the probability P(y) of being in a configuration denoted 
by y is shown for four temperatures during a slow anneal. (These estimates, based on 
a large number of runs, are nearly the theoretical values e7%/T) Early, at high T, 
each configuration is roughly equal in probability while late, at low T, the probability 
is strongly concentrated at the global minima. The expected value of the energy, € [E] 
(i.e., averaged at temperature T), decreases gradually during the anneal. 


to any axis. Note too that there are no stable local energy minima within the volume 
of the space; the energy minima always occur at the “corners,” i.e., extreme s; = +1 
for all 7, as required by the optimization problem. 

This search method is sometimes called mean-field annealing because each node 
responds to the average or mean of the forces (fields) due to the nodes connected to 
it. In essence the method approximates the effects of all other magnets while ignoring 
their mutual interactions and their response to the magnet in question, node i. Such 
annealing is also called deterministic because in principle we could deterministically 
solve the simultaneous equations governing the s; as the temperature is lowered. The 
algorithm has a natural parallel mode of implementation, for instance where each value 
si is updated simultaneously and deterministically as the temperature is lowered. In 
and inherently serial simulation, however, the nodes are updated one at a time. Even 
though the nodes might be polled pseudo randomly, the algorithm is in principle 
deterministic — there need be no inherent randomness in the searchn. If we let s;(1) 
denote the initial state of unit i, the algorithm is: 


Algorithm 2 (Deterministic simulated annealing) 


1 begin initialize T(k), w;;,s;(1),7,j =1,...N 
2 k=0 
3 do k=k+1 
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Figure 7.5: In deterministic annealing, each node can take on a continuous value 
—1 < s; < +1, which equals the expected value of a binary node in the system at 
temperature T. In other words, the analog value s; replaces the expectation of the 
discrete variable, €[s;]. We let l; denote a force exerted by the nodes connected to s;. 
The larger this force, the closer the analog s; is to +1; the more negative this force, 
the closer to —1. The temperature T (marked in red) also affects s;. If T is large, 
there is a great deal of randomness and even a large force will not insure s; = +1. 
At low temperature, there is little or no randomness and even a small positive force 
insures that s; = +1. Thus at the end of an anneal, each node has value s; = +1 or 
Si = —1. 


4 Select node i randomly 
Ni 
5 li = y WijSj 
j 
6 Si — f(lL,T(k)) 
7 until k = kmax or convergence criterion met 
8 return Es; i=1,...,N 
9 end 


In practice, deterministic and stochastic annealing give very similar solutions. In 
large real-world problems deterministic annealing is faster, sometimes by two or three 
orders of magnitude. 

Simulated annealing can also be applied to other classes of optimization problem, 
for instance, finding the minimum in > WijkSiSjSk- We will not consider such higher- 

ijk 
order problems, though they can be fhe basis of learning methods as well. 


7.3 Boltzmann learning 


For pattern recognition, we will use a network such as that in Fig. 7.1, where the input 
units accept binary feature information and the output units represent the categories, 
generally in the familiar 1-of-c representation (Fig. 7.7). During classification the 
input units are held fixed or clamped to the feature values of the input pattern; the 
remaining units are annealed to find the lowest energy, most probable configuration. 
The category information is then read from the final values of the output units. 
Of course, accurate recognition requires proper weights, and thus we now turn to a 
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s3=-1 


Figure 7.6: If the state variables s; can assume analog values (as in mean-field anneal- 
ing), the energy in Eq. 1 is a general quadratic form having minima at the extreme val- 
ues s; = +1. In this case N = 3 nodes are fully interconnected with arbitrary weights 
wij. While the total energy function is three-dimensional, we show two-dimensional 
surfaces for each of three values of s3. The energy is linear in each variable so long as 
the other variables are held fixed. Further, the energy is invariant with respect to the 
interchange of all variables s; > —s;. In particular, here the global minimum occurs 
as sı = —1, s2 = +1 and s3 = —1 as well as the symmetric configuration sı = +1, 
S2 = —l and s3 = +1. 


method for learning weights from training patterns. There are two closely related 
approaches to such learning, one based on stochastic and the other on deterministic 
simulated annealing. 


7.3.1 Stochastic Boltzmann learning of visible states 


Before we turn to our central concern — learning categories from training patterns 
— consider an alternate learning problem where we have a set of desired probabilities 
for all the visible units, Q(a) (given by a training set), and seek weights so that the 
actual probability P(a), achieved in random simulations, matches these probabilities 
over all patterns as closely as possible. In this alternative learning problem the desired 
probabilities would be derived from training patterns containing both input (feature) 
and output (category) information. The actual probability describes the states of a 
network annealed with neither input nor output variables clamped. 

We now make use of the distinction between configurations of “visible” units (the 
input and output, denoted a), and the hidden states, denoted 3, shown in Fig. 7.1. 
For instance, whereas a and b (c.f., Eq. 4) refered to different configurations of the 
full system, a and £ sill specify visible and hidden configurations. 

The probability of a visible configuration is the sum over all possible hidden con- 
figurations: 


Pla) = P(a, p) 


B 
Ne Eag/T 
B 


een (6) 
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Figure 7.7: When a network such as shown in Fig. 7.1 is used for learning, it is 
important to distinguish between two types of visible units — the d input units and 
c output units, which receive external feature and category information — as well as 
the remaining, hidden units. The state of the full network is indexed by an integer 
y, and since here there are 17 binary nodes, y is bounded 0 < y < 217. The state 
of the visible nodes is described by a; moreover, a describes the input and a” the 
output (the superscripts are not indexes, but merely refer to the input and output, 
respectively). The state of the hidden nodes is indexed by £. 


where Fag is the system energy in the configuration defined by the visible and hidden 
parts, and Z is again the full partition function. Equation 6 is based on Eq. 3 and 
states simply that to find the probability of a given visible state a, we sum over all 
possible hidden states. A natural measure of the difference between the actual and 
the desired probability distributions is the relative entropy, Kullback-Leibler distance 
or Kullback-Leibler divergence, 


(a) 
(a) 


Dex (Qla), Pla) = Y Qla)os (7) 


Naturally, Dxz is non-negative and can be zero if and only if Pla) = Q(a) for all a 
(Appendix ??). Note that Eq. 6 depends solely upon the visible units, not the hidden 
units. 

Learning is based on gradient descent in the relative entropy. A set of training 
patterns defines Q(a), and we seek weights so that at some temperature T the actual 
distribution P(a@) matches Q(a) as closely as possible. Thus we take an untrained 
network and update each weight according to: 


where y is a learning rate. While P depends on the weights, Q does not, and thus we 
used 0Q(a)/Ow;; = 0. We take the derivative in Eq. 6 and find: 
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Here s;(a@@) is the state of node i in the full configuration specified by a and 8. Of 
course, if node 1 is a visible one, then only the value of a is relevant; if the node is 
a hidden one, then only the value of 8 is relevant. (Our notation unifies these two 
cases.) The expectation value €[s;s,] is taken at temperature T. We gather terms 
and find from Eqs. 8 & 9 


Awij = z 5 Pia) ab)sj(aB)P(a, B) — S Q(a)E[sis5] 
= 7 Ya) (Ala) (ala) Elsa] 
= > Eg[siSjla clamped — €[Si 85] free (10) 
—- FS Tr OS 
learning unlearning 


where P(a, 3) = P(Pla)P(a). We have defined 


Eol[sisila clamped — 5 Q(a)P(Bla)silab)s;(ap) (11) 
aß 
to be the correlation of the variables s; and s; when the visible units are held fixed 
— clamped — in visible configuration a, averaged according to the probabilities of 
the training patterns, Q(a). 

The first term on the right of Eq. 10 is informally referred to as the learning 
component or teacher component (as the visible units are held to values given by 
the teacher), and the second term the unlearning or student component (where the 
variables are free to vary). If Eo [sisila clamped = €[SiS;] free, then Aw;; = 0 and we 
have achieved the desired weights. The unlearning component reduces spurious cor- 
relations between units — spurious in that they are not due to the training patterns. 
A learning algorithm based on the above derivation would present each pattern in 
the full training set several times and adjust the weights by Eq. 10, just as we saw in 
numerous other training methods such as backpropagation (Chap. ??). 


Stochastic Learning of input-output associations 


Now consider the problem of learning mappings from input to output — our real in- 
terest in pattern recognition. Here we want the network to learn associations between 
the (visible) states on the input units, denoted a’, and states on the output units, 
denoted a”, as shown in Fig. 7.1. Formally, we want P(a°|a') to match Q(a°|a’) 
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as closely as possible. The appropriate cost function here is the Kullback-Leibler 
divergence weighted by the probability of each input pattern: 


Dxi(Q(a*la?), P(a®|a’)) = 2 Pla) y cla KA, (12) 


Just as in Eq. 8, learning involves changing weights to reduce this weighted distance, 
i.e., 


(13) 


The derivation of the full learning rule follows closely that leading to Eq. 11; the only 
difference is that the input units are clamped in both the learning and unlearning 
components (Problem 11). The result is that the weight update is 


n 
AWij = T Epleisdatas clamped — Ef[sisj]ai clamped | - (14) 
E. pm c pr 
learning unlearning 


In Sect. 7.3.3 we shall present pseudocode for implementing the preferred, deter- 
ministic version of Boltzmann learning, but first we can gain intuition into the general 
method by considering the learning of a single pattern according to Eq. 14. Figure 7.8 
shows a seven-unit network being trained with the input pattern sı = +1, sg = +1 
and the output pattern sg = —1, s7 = +1. In a typical 1-of-c representation, this 
desired output signal would represent category wa. Since during both training and 
classification, the input units sı and s2 are clamped at the value +1, we have shown 
only the associated 2° = 32 configurations at the right. The energy before learn- 
ing (Eq. 1), corresponding to randomly chosen weights, is shown in black. After the 
weights are trained by Eq. 14 using the pattern shown, the energy is changed (shown 
in red). Note particularly that all states having the desired output pattern have their 
energies lowered through training, just as we need. Thus when these input states are 
clamped and the remaining networked annealed, the desired output is more likely to 
be found. 

Equation 14 appears a bit different from those we have encountered in pattern 
recognition, and it is worthwhile explaining it carefully. Figure 7.9 illustrates in 
greater detail the learning of the single training pattern in Fig. 7.8. Because sı and 
s2 are clamped throughout, €g[5152]aiaeclamped = 1 = El$152)aiclampea, and thus the 
weight wig is not changed, as indeed given by Eq. 14. Consider a more general case, 
involving sı and s7. During the learning phase both units are clamped at +1 and 
thus the correlation is Eg[s1s7]| = +1. During the unlearning phase, the output s7 is 
free to vary and the correlation is lower; in fact it happens to be negative. Thus, the 
learning rule seeks to increase the magnitude of w 7 so that the input sı = +1 leads 
to s7 = +1, as can be seen in the matrix on the right. Because hidden units are only 
weakly correlated (or anticorrelated), the weights linking hidden units are changed 
only slightly. 

In learning a training set of many patterns, each pattern is presented in turn, and 
the weights updated as just described. Learning ends when the actual output matches 
the desired output for all patterns (cf. Sect. 7.3.4). 
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Figure 7.8: The fully connected seven-unit network at the left is being trained via 
the Boltzmann learning algorithm with the input pattern sı = +1, so = +1, and the 
output values sg = —1 and s7 = +1, representing categories wı and wa, respectively. 
All 2° = 32 configurations with sı = +1, s2 = +1 are shown at the right, along 
with their energy (Eq. 1). The black curve shows the energy before training; the 
red curve shows the energy after training. Note particularly that after training all 
configurations that represent the full training pattern have been lowered in energy, 
i.e., have become more probable. Consequently, patterns that do not represent the 
training pattern become less probable after training. Thus, after training, if the input 
pattern sı = +1, s2 = +1 is presented and the remaining network annealed, there is 
an increased chance of yielding sg = —1, s7 = +1, as desired. 


7.3.2 Missing features and category constraints 


A key benefit of Boltzmann training (including its preferred implementation, described 
in Sect. 7.3.3, below) is its ability to deal with missing features, both during training 
and during classification. If a deficient binary pattern is used for training, input units 
corresponding to missing features are allowed to vary — they are temporarily treated 
as (unclamped) hidden units rather than clamped input units. As a result, during 
annealing such units assume values most consistent with the rest of the input pattern 
and the current state of the network (Problem 14). Likewise, when a deficient pattern 
is to be classified, any units corresponding to missing input features are not clamped, 
and are allowed to assume any value. 

Some subsidiary knowledge or constraints can be incorporated into a Boltzmann 
network during classification. Suppose in a five-category problem it is somehow known 
that a test pattern is neither in category w; nor w4. (Such constraints could come 
from context or stages subsequent to the classifier itself.) During classification, then, 
the output units corresponding to wı and w4 are clamped at s; = —1 during the 
anneal, and the final category read as usual. Of course in this example the possible 
categories are then limited to the unclamped output units, for wa, w3 and w5. Such 
constraint imposition may lead to an improved classification rate (Problem 15). 
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Figure 7.9: Boltzmann learning of a single pattern is illustrated for the seven-node 
network of Fig. 7.8. The (symmetric) matrix on the left shows the correlation of units 
for the learning component, where the input units are clamped to sı = +1, s2 = +1, 
and the outputs to sg = —1, s7 = +1. The middle matrix shows the unlearning 
component, where the inputs are clamped but outputs are free to vary. The difference 
between those matrices is shown on the right, and is proportional to the weight update 
(Eq. 14). Notice, for instance, that because the correlation between sı and sa is 
large in both the learning and unlearning components (because those variables are 
clamped), there is no associated weight change, i.e., Awi2 = 0. However, strong 
correlations between sı and s7 in the learning but not in the unlearning component 
implies that the weight w 17 should be increased, as can be seen in the weight update 
matrix. 


Pattern completion 


The problem of pattern completion is to estimate the full pattern given just a part 
of that pattern; as such, it is related to the problem of classification with missing 
features. Pattern completion is naturally addressed in Boltzmann networks. A fully 
interconnected network, with or without hidden units, is trained with a set of repre- 
sentative patterns; as before, the visible units correspond to the feature components. 
When a deficient pattern is presented, a subset of the visible units are clamped to 
the components of a partial pattern, and the network annealed. The estimate of the 
unknown features appears on the remaining visible units, as illustrated in Fig. 7.10 
(Computer exercise 3). Such pattern completion in Boltzmann networks can be more 
accurate when known category information is imposed at the output units. 


Boltzmann networks without hidden or category units are related to so-called 
Hopfield networks or Hopfield auto-association networks (Problem 12). Such networks 
store patterns but not their category labels. The learning rule for such networks does 
not require the full Boltzmann learning of Eq. 14. Instead, weights are set to be 
proportional to the correlation of the feature vectors, averaged over the training set, 
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Figure 7.10: A Boltzmann network can be used for pattern completion, i.e., filling in 
unknown features of a deficient pattern. Here, a twelve-unit network with five hidden 
units has been trained with the 10 numeral patterns of a seven-segment digital display. 
The diagram at the lower left shows the correspondence between the display segments 
and nodes of the network; a black segment is represented by a +1 and a light gray 
segment as a —1. Consider the deficient pattern consisting of s2 = —1, s5 = +1. If 
these units are clamped and the full network annealed, the remaining five visible units 
will assume values most probable given the clamped ones, as shown at the right. 


wij x Eqlsis), (15) 


with wii = 0; further, there is no need to consider temperature. Such learning is of 
course much faster than true Boltzmann learning using annealing. If a network fully 
trained by Eq. 15 is nevertheless annealed, as in full Boltzmann learning, there is no 
guarantee that the equilibrium correlations in the learning and unlearning phases are 
equal, i.e., that Aw;; = 0 (Problem 13). 

The successes of such Hopfield networks in true pattern recognition have been 
modest, partly because the basic Hopfield network does not have as natural an output 
representation for categorization problems. Occassionally, though they can be used in 
simple low-dimensional pattern completion or auto-association problems. One of their 
primary drawbacks is their limited capacity, analogous to the fact that a two-layer 
network cannot implement arbitrary decision boundaries as can a three-layer net. In 
particular, it has been shown that the number of d-dimensional random patterns that 
can be stored is roughly 0.14d — very limited indeed. In a Boltzmann with hidden 
units such as we have discussed, however, the number of hidden units can be increased 
in order to allow more patterns to be stored. 

Because Boltzmann networks include loops and feedback connections, the internal 
representations learned at the hidden units are often difficult to interpret. Occasion- 
ally, though, the pattern of weights from the input units suggests feature groupings 
that are important for the classification task. 
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7.3.3 Deterministic Boltzmann learning 


The computational complexity of stochastic Boltzmann learning in a network with 
hidden units is very high. Each pattern must be presented several times, and every 
anneal requires each unit to be polled several times. Just as mean-field annealing is 
usually preferable to stochastic annealing, so too a mean-field version of Boltzmann 
learning is preferable to the stochastic version. The basic approach in deterministic 
Boltzmann learning is to use Eq. 14 with mean-field annealing and analog values for 
the s;. Recall, at the end of deterministic simulated annealing, the values of s; are 
+1, as required by the problem. 

Specifically, if we let D be the set of training patterns x containing feature and 
category information, the algorithm is: 


Algorithm 3 (Deterministic Boltzmann learning) 


1 begin initialize D,7,T(k), wij i,j =1,...,N 
do Randomly select training pattern x 
Randomize states s; 
Anneal network with input and output clamped 
At final, low T, calculate [5;5j]aia°clamped 
Randomize states s; 
Anneal network with input clamped but output free 
At final, low T, calculate [s;5j]qictamped 
Wij — Wij + n/T [EA = E 
until k = kmag or convergence criterion met 
11 return Wij 
12 end 
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Using mean-field theory, it is possible to efficiently calculate approximations of the 
mean of correlations entering the gradient. The analog state s; of each unit replaces 
its average value E[s;] and could in theory be calculated by iteratively solving a set 
of nonlinear equations. The mean of correlations is then calculated by making the 
approximation €[s;s;] = E[s,|E[s;] = sisj, as shown in lines 5 & 8. 


7.3.4 Initialization and setting parameters 


As with virtually every classifier, there are several interrelated parameters that must 
be set in a Boltzmann network. The first are the network topology and number of 
hidden units. The number of visible units (input and output) is determined by the 
dimensions of the binary feature vectors and number of categories. In the absence 
of detailed information about the problem, we assume the network is fully intercon- 
nected, and thus merely the number of hidden units must be set. A popular alternate 
topology is obtained by eliminating interconnections among input units, as well as 
among output units. (Such a network is faster to train but will be somewhat less 
effective at pattern completion or classifying deficient patterns.) Of course, generally 
speaking the harder the classification problem the more hidden units will be needed. 
The question is then, how many hidden units should be used? 

Suppose the training set D has n distinct patterns of input-output pairs. An 
upper bound on the minimum number of hidden units is n — one for each pattern 
— where for each pattern there is a corresponding unique hidden unit having value 
8; = +1 while all others are —1. This internal representation can be insured in the 
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following way: for the particular hidden unit 7, set w;; to be positive for each input 
unit j corresponding to a +1 feature in its associated pattern; further set w;; to be 
negative for input units corresponding to a —1 feature. For the remaining hidden 
units, the sign of the corresponding weights should be inverted. Next, the connection 
from hidden unit i to the output unit corresponding to the known category should be 
positive, and negative to all other output units. The resulting internal representation 
is closely related to that in the probabilistic neural network implementation of Parzen 
windows (Chap. ??). Naturally, this representation is undesirable as the number of 
weights grows exponentially with the number of patterns. Training becomes slow; 
furthermore generalization tends to be poor. 

Since the states of the hidden units are binary valued, and since it takes [logan] 
bits to specify n different items, there must be at least [logan] hidden units if there 
is to be a distinct hidden configuration for each of the n patterns. Thus a lower 
bound on the number of hidden units is [log,n], which is necessary for a distinct 
hidden configuration for each pattern. Nevertheless, this bound need not be tight, as 
there may be no set of weights insuring a unique representation (Problem 16). Aside 
from these bounds, it is hard to make firm statements about the number of hidden 
units needed — this number depends upon the inherent difficulty of the classification 
problem. It is traditional, then, to start with a somewhat large net and use weight 
decay. Much as we saw in backpropagation (Chap. ??), a Boltzmann network with 
“too many” hidden units and weights can be improved by means of weight decay. 
During training, a small increment e is added to wij when s; and sj are both positive 
or both negative during learning phase, but subtracted in the unlearning phase. It is 
traditional to decrease e throughout training. Such a version of weight decay tends 
to reduce the effects on the weights due to spurious random correlations in units and 
to eliminate unneeded weights, thereby improving generalization. 

One of the benefits of Boltzmann networks over backpropagation networks is that 
“too many” hidden units in a backpropagation network tend to degrade performance 
more than “boo many” in a Boltzmann network. This is because during learning, there 
is stochastic averaging over states in a Boltzmann network which tends to smooth 
decision boundaries; backpropagation networks have no such equivalent averaging. 
Of course, this averaging comes at a higher computational burden for Boltzmann 
networks. 

The next matter to consider is weight initialization. Initializing all weights to 
zero is acceptable, but leads to unnecessarily slow learning. In the absence of infor- 
mation otherwise, we can expect that roughly half the weights will be positive and 
half negative. In a network with fully interconnected hidden units there is nothing 
to differentiate the individual hidden units; thus we can arbitrarily initialize roughly 
half of the weights to have positive values and the rest negative. Learning speed is 
increased if weights are initialized with random values within a proper range. Assume 
a fully interconnected network having N units (and thus N — 1 ~ N connections to 
each unit). Assume further that at any instant each unit has an equal chance of being 
in state s; = +1 or s; = —1. We seek initial weights that will make the net force 
on each unit a random variable with variance 1.0, roughly the useful range shown in 
Fig. 7.5. This implies weights should be initialized randomly throughout the range 
—/3/N < wij < +1,/3/N (Problem 17). 

As mentioned, annealing schedules of the form T(k +1) = cT(k) for 0 < c < 1 are 
generally used, with 0.8 < c < 0.99. 

If a very large number of iterations — several thousand — are needed, even 
c = 0.99 may be too small. In that case we can write c = e7*/*, and thus 
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T(k) = T(1)e7*/*o, and ko can be interpreted as a decay constant. The initial tem- 
perature T(1) should be set high enough that virtually all candidate state transitions 
are accepted. While this condition can be insured by choosing T'(1) extremely high, 
in order to reduce training time we seek the lowest adequate value of T(1). A lower 
bound on the acceptable initial temperature depends upon the problem, but can be 
set empirically by monitoring state transitions in short simulations at candidate tem- 
peratures. Let mı be the number of energy-decreasing transitions (these are always 
accepted), and mz the number of energy-increasing queries according to the anneal- 
ing algorithm; let €, [AE] denote the average increase in energy over such transitions. 
Then, from Eq. 4 we find that the acceptance ratio is 


R= number of accepted transitions _ mı + mə: exp[—€4[AF]/T(1)] (16) 
~ number of proposed transitions — mi + ma ` 


Rearranging terms we see that the initial temperature obeys 


Ef [AE] 


n= Inma] — In[m2R — mı (1 — R)]' 


(17) 


For any initial temperature set by the designer, the acceptance ratio may or may 
not be nearly the desired 1.0; nevertheless Eq. 17 will be obeyed. The appropriate 
value for T(1) is found through a simple iterative procedure. First, set T'(1) to zero 
and perform a sequence of mo trials (pollings of units); count empirically the number 
of energetically favorable (m1) and energetically unfavorable (mz) transitions. In 
general, mi + ma < my because many candidate energy increasing transitions are 
rejected, according to Eq. 4. Next, use Eq. 17 to calculate a new, improved value of 
T(1) from the observed mı and mz. Perform another sequence of mg trials, observe 
new values for mi and ma, recalculate T(1), and so on. Repeat this procedure until 
mı + ma ~ mg. The associated T(1) gives an acceptance ratio R ~ 1, and is thus to 
be used. In practice this method quickly yields a good starting temperature. 

The next important parameter is the learning rate y in Eq. 14. Recall that the 
learning is based on gradient descent in the weighted Kullback-Leibler divergence 
between the actual and the desired distributions on the visible units. In Chap. ?? 
we derived bounds on the learning rate for multilayer neural networks by calculating 
the curvature of the error, and finding the maximum value of the learning rate that 
insured stability. This curvature was based on a Hessian matrix, the matrix of second- 
order derivatives of the error with respect to the weights. In the case of an N-unit, 
fully connected Boltzmann network, whose N(N — 1)/2 weights are described by a 
vector w, this curvature is proportional to w‘Hw, where 


Ow? 
is the appropriate Hessian matrix and the Kullback-Liebler divergence is given by 
Eq. 12. Given weak assumptions about the classification problem we can estimate this 
Hessian matrix; the stability requirement is then simply y < T?/N? (Problem 18). 
Note that at large temperature T, a large learning rate is acceptable since the effective 
error surface is smoothed by high randomness. 

While not technically parameter setting, one heuristic that provides modest com- 
putational speedup is to propose changing the states of several nodes simultaneously 
early in an anneal. The change in energy and acceptance probability are calculated 


(18) 
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as before. At the end of annealing, however, polling should be of single units in order 
to accurately find the optimum configuration. 

A method which occasionally improves the final solution is to update and store the 
current best configuration during an anneal. If the basic annealing converges to a local 
minimum that is worse than this stored configuration, this current optimal should be 
used. This is a variant of the pocket algorithm which finds broad use in methods that 
do not converge monotonically or can get caught in local minima (Chap ??). 

There are two stopping criteria associated with Boltzmann learning. The first 
determines when to stop a single anneal (associated with either the learning or the 
unlearning components). Here, the final temperature should be so low that no ener- 
getically unfavorable transitions are accepted. Such information is readily apparent 
in the graph of the energy versus iteration number, such as shown at the right of 
Fig. 7.3. All N variables should be polled individually at the end of the anneal, to 
insure that the final configuration is indeed a local (though perhaps not global) energy 
minimum. 

The second stopping criterion controls the number of times each training pattern 
is presented to the network. Of course the proper criterion depends upon the inherent 
difficulty of the classification problem. In general, overtraining is less of a concern 
in Boltzmann networks than it is in multilayer neural networks trained via gradient 
descent. This is because the averaging over states in Boltzmann networks tends to 
smooth decision boundaries while overtraining in multilayer neural networks tunes 
the decision boundaries to the particular training set. A reasonable stopping criterion 
for Boltzmann networks is to monitor the error on a validation set (Chap. ??), and 
stop learning when this error no longer changes significantly. 


7.4 *Boltzmann networks and graphical models 


While we have considered fully interconnected Boltzmann networks, the learning al- 
gorithm (Algorithm 3) applies equally well to networks with arbitrary connection 
topologies. Furthermore, it is easy to modify Boltzmann learning in order to impose 
constraints such as weight sharing. As a consequence, several popular recognition 
architectures — so-called graphical models such as Bayesian belief networks and Hid- 
den Markov Models — have counterparts in structured Boltzmann networks, and this 
leads to new methods for training them. 

Recall from Chap. ?? that Hidden Markov Models consist of several discrete hidden 
and visible states; at each discrete time step t, the system is in a single hidden state 
and emits a single visible state, denoted w(t) and v(t), respectively. The transition 
probabilities between hidden states at successive time steps are 


aij = Plw;(t + 1)|wi(t)) (19) 


and between hidden and visible states at a given time are 


bin = Pluk(t)lo;(t)). (20) 


The Forward-Backward or Baum-Welch algorithm (Chap. ??, Algorithm ??) is tra- 
ditionally used for learning these parameters from a pattern of Ty visible states* 
Vir = {v(1), v(2),..., u(Ty)}- 


* Here we use Tẹ to count the number of discrete time steps in order to avoid confusion with the 
temperature T' in Boltzmann simulations. 
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Figure 7.11: A Hidden Markov Model can be “unfolded” in time to show a trellis, 
which can be represented as a Boltzmann chain, as shown. The discrete hidden 
states are grouped into vertical sets, fully interconnected by weights A;; (related to 
the HMM transition probabilities a;;). The discrete visible states are grouped into 
horizontal sets, and are fully interconnected with the hidden states by weights Bj, 
(related to transition probabilities bj). Training the net with a single pattern, or list 
of T visible states, consists of clamping the visible states and performing Boltzmann 
learning throughout the full network, with the constraint that each of the time shifted 
weights labeled by a particular A;; have the same numerical value. 


Recall that a Hidden Markov model can be “unfolded” in time to yield a trellis 
(Chap. ??, Fig. ??). A structured Boltzmann network with the same trellis topology 
— a Boltzmann chain — can be used to implement the same classification as the 
corresponding Hidden Markov Model (Fig. 7.11). Although it is often simpler to 
work in a representation where discrete states have multiple values, we temporarily 
work in a representation where the binary nodes take value s; = 0 or +1, rather than 
+1 as in previous discussions. In this representation, a special case of the general 
energy (Eq. 1) includes terms for a particular sequence of visible, V7", and hidden 


states wT! = {w(1),w(2),...,w(Ty)} and can be written as 
Ty-1 Ts 
Ewy = Elw, VW] =- Y Aig — Y Bir (21) 
t=1 t=1 


where the particular values of A;; and B;, terms depend implicitly upon the sequence. 
The choice of binary state representation implies that only the weights linking nodes 
that both have s; = +1 appear in the energy. Each “legal” configuration — consisting 
of a single visible unit and a single hidden unit at each time — implies a set of Aj; 
and Bj, (Problem 20). The partition function is the sum over all legal states, 


Za) a Pov, (22) 
WV 


which insures normalization. The correspondence between the Boltzmann chain at 
temperature T and the unfolded Hidden Markov model (trellis) implies 
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As = Tln Qij and Bjk = Tln bjk- (23) 


(As in our discussion of Hidden Markov Models, we assume the initial hidden state is 
known and thus there is no need to consider the correspondence of prior probabilities in 
the two approaches.) While the 0— 1 binary representation of states in the structured 
network clarifies the relationship to Hidden Markov Models through Eq. 21, the more 
familiar representation s; = +1 works as well. Weights in the structured Boltzmann 
network are trained according to the method of Sect. 7.3, though the relation to 
transition probabilities in a Hidden Markov Model is no longer simple (Problem 21). 


Other graphical models 


In addition to Hidden Markov Models, a number of graphical models have analogs 
in structured Boltzmann networks. One of the most general includes Bayesian belief 
nets, directed acyclic graphs in which each node can be in one of a number of discrete 
states, and nodes are interconnected with conditional probabilities (Chap. ??). As 
in the case of Hidden Markov Models, the correspondence with Boltzmann networks 
is clearest if the discrete states in the belief net are binary states; nevertheless in 
practice multistate representations more naturally enforce the constraints and are 
generally preferred (Computer exercise ??). 

A particularly intriguing recognition problem arises when a temporal signal has 
two inherent time scales, for instance the rapid daily behavior in a financial market 
superimposed on slow seasonal variations. A standard Hidden Markov Model typically 
has a single inherent time scale and hence is poorly suited to such problems. We might 
seek to use two interconnected HMMs, possibly with different numbers of hidden 
states. Alas, the Forward-Backward algorithm generally does not converge when 
applied to a model having closed loops, as when two Hidden Markov Models have 
cross connections. 

Here the correspondence with Boltzmann networks is particularly helpful. We can 
link two Boltzmann chains with cross connections, as shown in Fig. 7.12, to form 
a Boltzmann zipper. The particular benefit of such an architecture is that it can 
learn both short-time structure (through the “fast” component chain) as well as long- 
time structure (through the “slow” chain). The cross connections, labeled by weight 
matrix E in the figure, learn correlations between the “fast” and “slow” internal 
representations. Unlike the case in Eq. 23, the E weights are not simply related to 
transition probabilities, however (Problem ??). 

Boltzmann zippers can address problems such as acoustic speech recognition, 
where the fast chain learns the rapid transitions and structure of individual phonemes 
while the slow component chain learns larger structure associated with prosody and 
stress throughout a word or a full phrase. Related applications include speechreading 
(lipreading), where the fast chain learns the acoustic transitions and the slow chain 
the much slower transitions associated with the (visible) image of the talker’s lips, 
jaw and tongue and body gestures, where fast hand motions are coupled to slower 
large-scale motions of the arms and torso. 


7.5 *Evolutionary methods 


Inspired by the process of biological evolution, evolutionary methods of classifier de- 
sign employ stochastic search for an optimal classifier. These admit a natural imple- 


BOLTZMANN 
ZIPPER 


POPULATION 


SCORE 


FITNESS 


SURVIVAL 
OF THE 
FITTEST 


OFFSPRING 


PARENT 


26 CHAPTER 7. STOCHASTIC METHODS 


"fast" visible units 
Oo) oa aED Oo 


QD 


a 
5 A==M==AAS NS i 
ANS I LF f 
S ISHS X 
= DPE CAN 
ÉS OK AN N J NN A M X 
Š if ir si hg X i ) 
W W 
= i c ‘i hr 
3 l Nl 
> VU e 
> do A 
En 
CERO) 


"slow" visible units 


Figure 7.12: A Boltzmann zipper consists of two Boltzmann chains (cf. Fig. 7.11), 
whose hidden units are interconnected. The component chains differ in the rate 
at which visible features are sampled, and thus they capture structure at different 
temporal scales. Correlations are learned by the weights linking the hidden units, 
here labeled E. It is somewhat more difficult to train linked Hidden Markov Models 
to learn structure at different time scales. 


mentation on massively parallel computers. In broad overview, such methods proceed 
as follows. First, we create several classifiers — a population — each varying somewhat 
from the other. Next, we judge or score each classifier on a representative version of 
the classification task, such as accuracy on a set of labeled examples. In keeping with 
the analogy with biological evolution, the resulting (scalar) score is sometimes called 
the fitness. Then we rank these classifiers according to their score and retain the best 
classifiers, some portion of the total population. Again, in keeping with biological 
terminology, this is called survival of the fittest. 

We now stochastically alter the classifiers to produce the next generation — the 
children or offspring. Some offspring classifiers will have higher scores than their 
parents in the previous generation, some will have lower scores. The overall process 
is then repeated for subsequent generation: the classifiers are scored, the best ones 
retained, randomly altered to give yet another generation, and so on. In part because 
of the ranking, each generation has, on average, a slightly higher score than the 
previous one. The process is halted when the single best classifier in a generation has 
a score that exceeds a desired criterion value. 

The method employs stochastic variations, and these in turn depend upon the 
fundamental representation of each classifier. There are two primary representations 
we shall consider: a string of binary bits (in basic genetic algorithms), and snippets 
of computer code (in genetic programming). In both cases, a key property is that 
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occasionally very large changes in classifier are introduced. The presence of such 
large changes and random variations implies that evolutionary methods can find good 
classifiers even in extremely complex discontinuous spaces or “fitness landscapes” that 
are hard to address by techniques such as gradient descent. 


7.5.1 Genetic Algorithms 


In basic genetic algorithms, the fundamental representation of each classifier is a bi- 
nary string, called a chromosome. The mapping from the chromosome to the features 
and other aspects of the classifier depends upon the problem domain, and the designer 
has great latitude in specifying this mapping. In pattern classification, the score is 
usually chosen to be some monotonic function of the accuracy on a data set, possibly 
with penalty term to avoid overfitting. We use a desired fitness, 0, as the stopping 
criterion. Before we discuss these points in more depth, we first consider more specif- 
ically the structure of the basic genetic algorithm, and then turn to the key notion of 
genetic operators, used in the algorithm. 


Algorithm 4 (Basic Genetic algorithm) 


1 begin initialize 0, P.., Pinar, L N-bit chromosomes 


CHROMOSOME 


2 do Determine fitness of each chromosome, fi, i = 1,..., L 

3 Rank the chromosomes 

4 do Select two chromosomes with highest score 

5 if Rand[0, 1) < Peo then crossover the pair at a randomly chosen bit 
6 else change each bit with probability Pur 

7 Remove the parent chromosomes 

8 until N offspring have been created 

9 until Any chromosome’s score f exceeds 6 
10 return Highest fitness chromosome (best classifier) 
11 end 


Figure 7.13 shows schematically the evolution of a population of classifiers given by 
Algorithm 4. 


Genetic operators 


There are three primary genetic operators that govern reproduction, i.e., producing 
offspring in the next generation described in lines 5 & 6 of Algorithm 4. The last two 
of these introduce variation into the chromosomes (Fig. 7.14): 


Replication: A chromosome is merely reproduced, unchanged. 


Crossover: Crossover involves the mixing — “mating” — of two chromosomes. A 
split point is chosen randomly along the length of either chromosome. The first 
part of chromosome A is spliced to the last part of chromosome B, and vice 
versa, thereby yielding two new chromosomes. The probability a given pair of 
chromosomes will undergo crossover is given by Peo in Algorithm 4. 


Mutation: Each bit in a single chromosome is given a small chance, Pmut, of being 
changed from a 1 to a 0 or vice versa. 


MATING 
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Figure 7.13: A basic genetic algorithm is a stochastic iterative search method. Each 
of the L classifiers in the population in generation k is represented by a string of 
bits of length N, called a chromosome (on the left). Each classifier is judged or 
scored according its performance on a classification task, giving L scalar values f;. 
The chromosomes are then ranked according to these scores. The chromosomes are 
considered in descending order of score, and operated upon by the genetic operators 
of replication, crossover and mutation to form the next generation of chromosomes — 
the offspring. The cycle repeats until a classifier exceeds the criterion score 0. 


Other genetic operators may be employed, for instance inversion — where the chromo- 
some is reversed front to back. This operator is used only rarely since inverting a 
chromosome with a high score nearly always leads to one with very low score. Below 
we shall briefly consider another operator, insertions. 


Representation 


When designing a classifier by means of genetic algorithms we must specify the map- 
ping from a chromosome to properties of the classifier itself. Such mapping will depend 
upon the form of classifier and problem domain, of course. One of the earliest and 
simplest approaches is to let the bits specify features (such as pixels in a character 
recognition problem) in a two-layer Perceptron with fixed weights (Chap. ??). The 
primary benefit of this particular mapping is that different segments of the chromo- 
some, which generally remain undisturbed under the crossover operator, may evolve 
to recognize different portions of the input space such as the descender (lower) or the 
ascender (upper) portions of typed characters. As a result, occasionally the crossover 
operation will append a good segment for the ascender region in one chromosome 
to a good segment for the descender region in another, thereby yielding an excellent 
overall classifier. 


Another mapping is to let different segments of the chromosome represent the 
weights in a multilayer neural net with a fixed topology. Likewise, a chromosome 
could represent a network topology itself, the presence of an individual bit implying 
two particular neurons are interconnected. One of the most natural representations 
is for the bits to specify properties of a decision tree classifier (Chap. ??), as shown 
in Fig. 7.15. 
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Figure 7.14: Three basic genetic operations are used to transform a population of 
chromosomes at one generation to form a new generation. In replication, the chromo- 
some is unchanged. Crossover involves the mixing or “mating” of two chromosomes 
to yield two new chromosomes. A position along the chromosomes is chosen randomly 
(red vertical line); then the first part of chromosome A is linked with the last part of 
chromosome B, and vice versa. In mutation, each bit is given a small chance of being 
changed from a 1 to a 0 or vice versa. 


Scoring 


For a c-category classification problem, it is generally most convenient to evolve c 
dichotomizers, each to distinguish a different w; from all other w; for j 4 i. During 
classification, the test pattern is presented to each of the c dichotomizers and assigned 
the label accordingly. The goal of classifier design is accuracy on future patterns, or if 
decisions have associated costs, then low expected cost. Such goals should be reflected 
in the method of scoring and selection in a genetic algorithm. Given sample patterns 
representative version of the target classification task, it is natural to base the score on 
the classification accuracy measured on the data set. As we have seen numerous times, 
there is a danger that the classifier becomes “tuned” to the properties of the particular 
data set, however. (We can informally broaden our usage of the term “overfitting” 
from generic learning to apply to this search-based case as well.) One method for 
avoiding such overfitting is penalizing classifier complexity, and thus the score should 
have a term that penalizes overly large networks. Another method is to adjusting 
the stopping criterion. Since the appropriate measure of classifier complexity and 
the stopping criterion depend strongly on the problem, it is hard to make specific 
guidelines in setting these parameters. Nevertheless, designers should be prepared to 
explore these parameters in any practical application. 


Selection 


The process of selection specifies which chromosomes from one generation will be 
sources for chromosomes in the next generation. Up to here, we have assumed that 
the chromosomes would be ranked and selected in order of decreasing fitness until the 
next generation is complete. This has the benefit of generally pushing the population 
toward higher and higher scores. Nevertheless, the average improvement from one 
generation to the next depends upon the variance in the scores at a given generation, 
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Figure 7.15: One natural mapping is from a binary chromosome to a binary tree 
classifier, illustrated here for a four-feature, monothetic tree dichotomizer. In this 
example, each of the nodes computes a query of the form +x; < 0? and is governed 
by nine bits in the chromosome. The first bit specifies a sign, the next two bits 
specify the feature queried. The remaining six bits are a binary representation of the 
threshold 9. For instance, the left-most node encodes the rule +a3 < 41? (In practice, 
larger trees would be used for problems with four features.) 


and because this standard fitness-based selection need not give high variance, other 
selection methods may prove superior. 


FITNESS- The principle alternative selection scheme is fitness-proportional selection, or fitness- 
PROPORTIONAL proportional reproduction, in which the probability that each chromosome is selected 
SELECTION is proportional to its fitness. While high-fitness chromosomes are preferentially se- 


lected, occasionally low-fitness chromosomes are selected, and this may preserve di- 
versity and increase variance of the population. 


A minor modification of this method is to make the probability of selection propor- 
tional to some monotonically increasing function of the fitness. If the function instead 
has a positive second derivative, the probability that high-fitness chromosomes is en- 
hanced. One version of this heuristic is inspired by the Boltzmann factor of Eq. 2; 
the probability that chromosome i with fitness f; will be selected is 
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; efi/T 

P(i) ~~ Elefi/T] > (24) 
where the expectation is over the current generation and T is a control parameter 
loosely referred to as a temperature. Early in the evolution the temperature is set 
high, giving all chromosomes roughly equal probability of being selected. Late in the 
evolution the temperature is set lower so as to find the chromosomes in the region of 
the optimal classifier. We can express such search by analogy to biology: early in the 
search the population remains diverse and explores the fitness landscape in search of 
promising areas; later the population exploits the specific fitness opportunities in a 
small region of the space of possible classifiers. 


7.5.2 Further heuristics 


There are many additional heuristics that can occasionally be of use. One concerns 
the adaptation of the crossover and mutation rates, Peo and Pmuz. If these rates are 
too low, the average improvement from one generation to the next will be small, and 
the search unacceptably long. Conversely, if these rates are too high, the evolution 
is undirected and similar to a highly inefficient random search. We can monitor the 
average improvement in fitness of each generation and the mutation and crossover 
rates as long as such improvement is rapid. In practice, this is done by encoding the 
rates in the chromosomes themselves and allowing the genetic algorithm to select the 
proper values. 

Another heuristic is to use a ternary, or n-ary chromosomes rather than the tradi- 
tional binary ones. These representations provide little or no benefit at the algorith- 
mic level, but may make the mapping to the classifier itself more natural and easier 
to compute. For instance, a ternary chromosome might be most appropriate if the 
classifier is a decision tree with three-way splits. 

Occasionally the mapping to the classifier will work for chromosomes of differ- 
ent length. For example, if the bits in the chromosome specify weights in a neural 
network, then longer chromosomes would describe networks with a larger number of 
hidden units. In such a case we allow the insertion operator, which with a small 
probability inserts bits into the chromosome at a randomly chosen position. This 
so-called “messy” genetic algorithm method has a more appropriate counterpart in 
genetic programming, as we shall see in Sect. 7.6. 


7.5.3 Why do they work? 


Because there are many heuristics to choose as well as parameters to set, it is hard to 
make firm theoretical statements about building classifiers by means of evolutionary 
methods. The performance and search time depend upon the number of bits, the size 
of a population, the mutation and crossover rates, choice of features and mapping 
from chromosomes to the classifier itself, the inherent difficulty of the problem and 
possibly parameters associated with other heuristics. 

A genetic algorithm restricted to mere replication and mutation is, at base, a 
version of stochastic random search. The incorporation of the crossover operator, 
which mates two chromosomes, provides a qualitatively different search, one that 
has no counterpart in stochastic grammars (Chap. ??). Crossover works by finding, 
rewarding and recombining “good” segments of chromosomes, and the more faithfully 
the segments of the chromosomes represent such functional building blocks, the better 
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we can expect genetic algorithms to perform. The only way to insure this is with prior 
knowledge of the problem domain and the desired form of classifier. 


7.6 *Genetic Programming 


Genetic programming shares the same algorithmic structure of basic genetic algo- 
rithms, but differs in the representation of each classifier. Instead of chromosomes 
consisting of strings of bits, genetic programming uses snippets of computer programs 
made up of mathematical operators and variables. As a result, the genetic operators 
are somewhat different; moreover a new operator plays a significant role in genetic 
programming. 

The four principal operators in genetic programming are (Fig. 7.16): 


Replication: A snippet is merely reproduced, unchanged. 


Crossover: Crossover involves the mixing — “mating” — of two snippets. A split 
point is chosen from allowable locations in snippet A as well as from snippet B. 
The first part of snippet A is spliced to the back part of chromosome B, and 
vice versa, thereby yielding two new snippets. 


Mutation: Each bit in a single snippet is given a small chance of being changed to 
a different value. Such a change must be compatible with the syntax of the 
total snippet. For instance, a number can be replaced by another number; a 
mathematical operator that takes a single argument can be replaced by another 
such operator, and so forth. 


Insertion: Insertion consists in replacing a single element in the snippet with another 
(short) snippet randomly chosen from a set. 


In the c-category problem, it is simplest to form c dichotomizers just as in genetic 
algorithms. If the output of the classifier is positive, the test pattern belongs to 
category wi, if negative, then it is NOT in wi. 


Representation 


A program must be expressed in some language, and the choice affects the complexity 
of the procedure. Syntactically rich languages such as C or C++ are complex and 
somwhat difficult to work with. Here the syntactic simplicity of a language such asLisp 
is advantageous. Many Lisp expressions can be written in the form (<operator> 
<operand> <operand>), where an <operand> can be a constant, a variable or another 
parenthesized expression. For example, (+ X 2) and (* 3 (+ Y 5)) are valid Lisp 
expressions for the arithmetic expressions « + 2 and 3(y + 5), respectively. These 
expressions are easily represented by a binary tree, with the operator being specified 
at the node and the operands appearing as the children (Fig. 7.17). 

Whatever language is used, genetic programming operators used for mutation 
should replace variables and constants with variables and constants, and operators 
with functionally compatible operators. They should aslo be required to produce 
syntactically valid results. Nevertheless, occassionally an ungrammatical code snippet 
may be produced. For that reason, it is traditional to employ a wrapper — a routine 
that decides whether the classifier is meaningful, and eliminates them if not. 
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Figure 7.16: Four basic genetic operations are used to transform a population of 
snippets of code at one generation to form a new generation. In replication, the 
snippet is unchanged. Crossover involves the mixing or “mating” of two snippets to 
yield two new snippets. A position along the snippet A is randomly chosen from the 
allowable locations (red vertical line); likewise one is chosen for snippet B. Then the 
front portion of A is spliced to the back portion of B and vice versa. In mutation, 
each element is given a small chance of being changed. There are several different 
types of elements, and replacements must be of the same type. For instance, only a 
number can replace another number; only a numerical operator that takes a single 
argument can replace a similar operator, and so on. In insertion, a randomly selected 
element is replaced by a compatible snippet, keeping the entire snippet grammatically 
well formed and meaningful. 


It is nearly impossible to make sound theoretical statements about genetic pro- 
gramming and even the rules of thumb learned from simulations in one domain, such 
as control or function optimization are of little value in another domain, such as clas- 
sification problems. Of course, the method works best in problems that are matched 
by the classifier representation, as simple operations such as multiplication, division, 
square roots, logical NOT, and so on. 

Nevertheless, we can state that as computation continues to decrease in cost, more 
of the burden of solving classification problems will be assumed by computation rather 
than careful analysis, and here techniques such as evolutionary ones will be of use in 
classification research. 


Summary 


When a pattern recognition problem involves a model that is discrete or of such 
high complexity that analytic or gradient descent methods are unlikely to work, we 
may employ stochastic techniques — ones that at some level rely on randomness to 
find model parameters. Simulated annealing, based on physical annealing of metals, 
consists in randomly perturbing the system, and gradually decreasing the randomness 
to a low final level, in order to find an optimal solution. Boltzmann learning trains the 
weights in a network so that the probability of a desired final output is increased. Such 
learning is based on gradient descent in the Kullback-Liebler divergence between two 
distributions of visible states at the output units: one distribution describes these 
units when clamped at the known category information, and the other when they 
are free to assume values based on the activations throughout the network. Some 
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Figure 7.17: Unlike the decision trees of Fig. 7.15 and Chap. ??, the trees shown here 
are merely a representation using the syntax of Lisp that implements a single function. 
For instance, the upper-right (parent) tree implements PRCA Ne Such functions are 
used with an implied threshold or sign function when used for classification. Thus 
the function will operate on the features of a test pattern and emit category w; if the 


function is positive, and NOT w; otherwise. 


graphical models, such as hidden Markov models and Bayes belief networks, have 
counterparts in structured Boltzmann networks, and this leads to new applications of 
Boltzmann learning. 


Search methods based on evolution — genetic algorithms and genetic programming 
— perform highly parallel stochastic searches in a space set by the designer. The fun- 
damental representation used in genetic algorithms is a string of bits, or chromosome; 
the representation in genetic programming is a snippet of computer code. Variation 
is introduced by means of crossover, mutation and insertion. As with all classification 
methods, the better the features, the better the solution. There are many heuristics 
that can be employed and parameters that must be set. As the cost of computation 
contiues to decline, computationally intensive methods, such as Boltzmann networks 
and evolutionary methods, should become increasingly popular. 
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Bibliographical and Historical Remarks 


The general problem of search is of central interest in computer science and artificial 
intelligence, and is far to expansive to treat here. Nevertheless, techniques such as 
depth first, breadth first, branch-and-bound, A* [19], occassionally find use in fields 
touching upon pattern recognition, and practitioners should have at least a passing 
knowledge of them. Good overviews can be found in [33] and a number of textbooks 
on artificial intelligence, such as [46, 67, 55]. For rigor and completeness, Knuth’s 
book on the subject is without peer [32]. 

The infinite monkey theorem, attributed to Sir Arthur Eddington, states that if 
there is a sufficiently large number of monkeys typing at typewriters, eventually one 
will bang out the script to Hamlet. It reflects one extreme of the tradeoff between prior 
knowledge about the location of a solution on the one hand and the effort of search 
required to fit it on the other. Computers made available in the early 1950s permitted 
the first automated attempts at highly stochastic search, most notably the pioneering 
work of Metropolis and colleagues for simulating chemical processes [40]. One of the 
earliest and most influential applications of stochastic methods for pattern recognition 
was the Pandemonium learning method due to Selfridge, which used stochastic search 
for input weights in a feed-forward network model [57]. Kirkpatrick, Gelatt and Vec- 
chi [30], and independently Cerny [64], introduced the Boltzmann factor to general 
stochastic search methods, the first example of simulated annealing. The statistical 
physics foundations of Boltzmann factors, at the present level of mathematical sophis- 
tication, can be found in [31]. The physical model of stochastic binary components 
was introduced by Wilhemlm Lenz in 1920, but became associated with his doctoral 
student Ernst Ising several years thereafter, and first called the “Ising model” in a 
paper by R. Peierls [50]. It has spawned a great deal of theoretical and simulation 
research [20]. 

The use of simulated annealing for learning was proposed by Ackley, Hinton and 
Sejnowski [2], a good book on the method is [1], which described the procedure for ini- 
tializing the temperature in simulated annealing and was the inspiration for Fig. 7.10. 
Peterson and Anderson introduced deterministic annealing and mean-field Boltzmann 
learning and described some of the (rare) conditions when the mean-field approxima- 
tion might lead to non-optimal solutions [51]. Hinton showed that the Boltzmann 
learning rule performs steepest descent in weight space for deterministic algorithm 
[21]. 

A number of papers explore structured Boltzmann networks, including Hopfield’s 
influential paper on networks for pattern completion or auto-association [25]. The 
linear storage capacity of Hopfield networks quoted in the text, and nlogn relationships 
for partial storage, are derived in [66, 39, 65]. The learning rule described in that work 
has roots in the Learning matrix of [59, 60]. Harmonium [58, 14], another two-layer 
variant of a Boltzmann network is primarily of historical interest. The relation of 
Boltzmann networks to graphical models such as Hidden Markov models has been 
explored in [27, 37] and [56], which was the source for our discussion in Sect. 7.4. 
Implementation of constraints for Boltzmann machines was introduced in [42] and a 
second-order pruning algorithm was described in [49]. 

Boltzmann learning has been applied to a number of real-world pattern recognition 
problems, most notably speech recognition [8, 52] and stochastic restoration of images 
or pattern completion [16]. Because Boltzmann learning has high computational 
burden yet a natural VLSI implementation, a number of special-purpose chips have 
been fabricated [23, 43, 44]. The ordering of configurations in Fig. 7.3, in which 
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neighboring configurations differe in just one bit, is a version of a Gray code; an 
elegant method for constructing such codes is described in [18, Sect. 5.16 — 5.17]. 
Some of the earliest work inspired by evolution was described in [12, 13], but the 
computational power available was insufficient for anything but toy problems. Later, 
Rechenberg’s “evolution strategies” were applied to optimization in aeronautical de- 
sign problems [53]. His earliest work did not employ full populations of candidate 
solutions, nor the key operation of crossover. Evolutionary programming saves good 
parents while evolutionary strategies generally does not. Neither employ mating, i.e., 
crossover. Holland introduced genetic algorithms in 1975 [24], and like the algorithm 
itself, researchers have explored a very wide range of problems in search, optimization 
and pattern recognition. A review appears in [6], and there is an increasing number 
of textbooks [17, 41], the latter with a more rigorous approach to the mathemat- 
ics. Koza’s extensive books on Genetic Programming provide a good introduction, 
and include several illustrative simulations [34, 35], though relatively little on pattern 
recognition. There are several collections of papers on evolutionary techniques in pat- 
tern recognition, including [48]. An intriguing effect due to the interaction of learning 
and evolution is the Baldwin effect, where learning can influence the rate of evolution 
[22]; it has been shown that too much learning (as well as too little learning) leads to 
slower evolution [28]. Evolutionary methods can lead to “non-optimal” or inelegant 
solutions, and there is computational evidence that this occurs in nature [61, 62]. 


Problems 


Q Section 7.1 


1. One version of the infinite monkey theorem states that a single (immortal) monkey 
typing randomly will ultimately reproduce the script of Hamlet. Estimate the time 
needed for this, assuming the monkey can type two characters per second, that the 
ER) play has 50 pages, each containing roughly 80 lines, and 40 characters per line. Assume 
there are 30 possible characters (a through z), space, period, exclamation point and 
carriage return. Compare this time to the estimated age of the universe, 10% years. 


Q Section 7.2 


2. Prove that for any optimization problem of the form of Eq. 1 having a non- 
symmetric connection matrix, there is an equivalent optimization problem in which 
the matrix is replaced by its symmetric part. 

3. The complicated energy landscape in the left of Fig. 7.2 is misleading for a number 
of reasons. 


(a) Discuss the difference between the continuous space shown in that figure with 
the discrete space for the true optimization problem. 


(b) The figure shows a local minimum near the middle of the space. Given the 
nature of the discrete space, are any states closer to any “middle”? 


(c) Suppose the axes referred to continuous variables s; (as in mean-field annealing). 
If each s; obeyed a sigmoid (Fig. 7.5), could the energy landscape be non- 
monotonic, as is shown in Fig. 7.2? 


4. Consider exhaustive search for the minimum of the energy given in Eq. 1 for 
binary units and arbitrary connections wij. Suppose that on a uniprocessor it takes 
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1078 seconds to calculate the energy for each configuration. How long will it take to 
exhaustively search the space for N = 100 units? How long for N = 1000 units? 
5. Suppose it takes a uniprocessor 107!% seconds to perform a single multiply- 
accumulate, wijSsiSj, in the calculation of the energy E = —1/2 X wijsisj given in 
a 
Eq. 1. 


(a) Make some simplifying assumptions and write a formula for the total time re- 
quired to search exhaustively for the minimum energy in a fully connected net- 
work of N nodes. 


(b) Plot your function using a log-log scale for N = 1,...,10°. 
(c) What size network, N, could be searched exhaustively in a day? A year? A 
century? 


6. Make and justify any necessary mathematical assumptions and show analytically 
that at high temperature, every configuration in a network of N units interconnected 
by weights is equally likely (cf. Fig. 7.1). 

7. Derive the exponential form of the Boltzmann factor in the following way. Consider 
an isolated set of M + N independent magnets, each of which can be in an s; = +1 
or s; = —1 state. There is a uniform magnetic field applied and this means that the 
energy of the s; = +1 state has some positive energy, which we can arbitrarily set to 
1; the s; = —1 state has energy —1. The total energy of the system is therefore the 
sum of the number pointing up, ku, minus the number pointing down, kg; that is, 
Er = ku — ka. (Of course, ku + ka = M + N regardless of the total energy.) 

The fundamental statistical assumptions describing this system are that the mag- 
nets are independent, and that the probability a subsystem (viz., the N magnets), 
has a particular energy is proportional to the number of configurations that have this 
energy. 


(a) Consider the subsystem of N magnets, which has energy Ey. Write an expres- 
sion for the number of configurations K(N, Ey) that have energy Ey. 


(b) As in part (a), write a general expression for the number of configurations in 
the subsystem M magnets at energy Em, i.e., K(M, Em). 


(c) Since the two subsystems consist of independent magnets, total number of 
ways the full system can have total energy Er = Ey + Ey is the product 
K(N, En) K(M, Em). Write an analytic expression for this total number. 


(d) In statistical physics, if M > N, the M-magnet subsystem is called the heat 
reservoire or heat bath. Assume that M >> N, and write a series expansion for 
your answer to part (c). 


(e) Use your answer in part (d) to show that the probability the N-unit system has 
energgy Ey has the form of a Boltzmann factor, e EN, 


8. Prove that the analog value of s; given by Eq. 5 is the expected value of a binary 

variable in temperature T in the following simple case. Consider a single binary 
magnet whose s = +1 state has energy +E and s = —1 state has energy — Eo, as 
would occur if an external magnetic field has been applied. 
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(a) Construct the partition function Z by summing over the two possible states 
y = 0 and y = 1 according to Eq. 3. 


(b) Recall that the probability of finding the system in state s = +1 is given by a 
Boltzmann factor divided by the partition function (Eq. 2). Define the (analog) 
expected value of the state to be 


s = E[s] = P(s = +1)(+1) + P(s = —1)(-1). 
Show that this implies the analog state of a single magnet obeys Eq. 5. 


(c) Argue that if the N — 1 other magnets in a large system can be assumed to give an 
average field (this is the mean-field approximation), then the analog value of a single 
magnet will obey a function of the form given in Eq. 5. 

9. Consider Boltzmann networks applied to the exclusive-OR problem. 


(a) A fully connected network consisting solely of two input units and a single output 
unit, whose sign gives the class, cannot solve the exclusive-OR problem. Prove 
this by writing a set of inequalities for the weights and show that they are 
inconsistent. 


(b) As in part (a), prove that a fully connected Boltzmann network consisting solely 
of two input units and two output units representing the two categories cannot 
solve the exclusive-OR problem. 


(c 


pee 


Prove that a Boltzmann network of part (b) with a single hidden unit can im- 
plement the exclusive-OR problem. 


10. Consider a fully-connected Boltzmann network with two input units, a single 
hidden unit and a single (category) output unit. Construct by hand a set of weights 
wij for i,j = 1,2,3,4 which allows the net to solve the exclusive-OR problem for a 
representation in which s; = +1. 


Q Section 7.3 


11. Show all intermediate steps in the derivation of Eq. 14 from Eq. 12. Be sure 
your notation distinguishes this case from that leading to Eq. 10. 

12. Train a six-unit Hopfield network with the following three patterns using the 
learning rule of Eq. 15. 


x! = {+1,+1,+1,-1,-1,-1} 
x? = {+1,—-1,+1,-1,+1,-1} 
x? = {-1,+1,-1,-1,+1,+1} 


(a) Verify that each of the patterns gives a local minium in energy by perturbing 
each of the six units individually and monitoring the energy. 


(b) Verify that the symmetric state s; > —s; fori = 1,...,6 also gives a local energy 
minimum of the same energy. 


13. Repeat Problem 12 but with the eight-unit network and the following patterns: 


x! = (4+1,+1,+1,-1,-1,-1,-1,+1) 
x? = (+1,-1,+1,+1,+1,-1,+1,-1) 
x? = (-1,+1,-1,-1,+1,+1,-1,+1) 
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14. show that a missing feature assumes the appropriate value when training a 
deficient pattern in a Boltzmann network. 

15. show how if constraints that a pattern is not in a set of categories improves the 
recognition for the others. 

16. The text states a lower bound on the number of hidden units needed in a 
Boltzmann network trained with n patterns is [logan]. This is, of course, the number 
of hiddens needed to insure a distinct hidden representation for each pattern. Show 
that this lower bound is not tight, as there may not be weights to insure such a 
representation. Do this by considering a Boltzmann network with three input units, 
three hiddens and a single output, addressing the 3-bit parity problem. 


(a) Argue that the hidden representation must be equivalent to the input represen- 
tation. 


(b) Argue that there is no two-layer Boltzmann network (here, hidden to output) 
that can solve the three-bit parity problem. Explain why this implies that the 
[logon] bound is not tight. 


17. Consider the problem of initializing the N weights in a fully connected Boltzmann 
network. Let there be N — 1 ~ N weights connected to each unit. Suppose too that 
the chance that any particulat units will be in the s; = +1 state is 0.5, and likewise 
for the s; = —1 state. We seek weights such that the variance of the net activation 
of each unit is roughly 1.0, a reasonable measure of the end of the linear range of the 
sigmoid nonlinearity. The variance of l; is 

N 

VARJI] = ys VAR[wi7 55] = NV AR[w;;|V AR[s;].. 

j=l 
Set VAR[I;] = 1, and solve for VAR[w;,;] and thereby show that weights should be 
initialized randomly in the range —1,/3/N < wij < +y3/N. 
18. Show that under reasonable conditions, the learning rate 7 in Eq. 14 for a 
Boltzmann network of N units should be bounded 7 < T 2 /N to insure stability as 
follows: 


(a) Take the derivative of Eq. 14 to prove that the Hessian is 


H = Dre _ Det 


Ow? 7 OWijOWur 
1 
= 7 [Els¡sjSusy] — E[sis;]E[susv]] - 


(b) Use this to show that 
t 1 
(c) Suppose we normalize weights such that ||w|| = 1 and thus 
> Wij < VN. 
ij 


Use this fact together with your answer to part (b) to show that the curvature 
of the Dx L obeys 


w' Hw < me (vx) | = = 
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(d) Use the fact that stability demands the learning rate to be the inverse of the 
curvature, along with your answer in (c), to show that the learning rate should 
be bounded y < T?/N. 


Q Section 7.4 


19. For any HMM, there exists a Boltzmann chain that implements the equivalent 
probability model. Show the converse is not true, that is, for every chain, there 
exists an HMM. Use the fact that weights in a Boltzmann chain are bounded —oo < 
Aij, Bj < +00, but probabilities in an HMM are positive and sum to 1. 

20. For a Boltzmann chain with Tp steps, c hidden units and xx visible units, how 
many legal paths are there (cf. Fig. 7.11). 

21. The discussion of the relation between Boltzmann chains and hidden Markov 
models in the text assumed the initial hidden state was known. Show that if this 
hidden state is not known, the energy of Eq. 21 has another term which describes the 
prior probability the system is in a particular hidden state. 


Q Section 7.5 


22. Consider the populations of size L of N-bit chromosomes. 


ei 


(a) Show the number of different populations is (“3x _} 


(b) Assume some number 1 < L, < L are selected for reproduction in a given 
generation. Use your answer to part (a) to write an expression for the number 
of possible sets of parents as a function of L and La. (It is just the set, not their 
order that is relevant.) 


c) Show that your answer to part b) reduces to that in part a) for the case 
La = 


(d) Show that your answer to part (b) gives L in the case La = 1. 


Q Section 7.6 


23. For each of the below snippets, mark suitable positions for breaks for the crossover 
operator. 


(a) (* (XO (+ x4 x8)) x5 (SQRT 5)) 
(b) (SQRT ( XO (+ x4 x8))) 


) 
) 
(c) (* (- (SIN XO) (* (TAN 3.4) (SQRT X4))) 
(d) (* (XO (+ x4 x8)) x5 (SQRT 5)) 

) 


(e) Separate the following Lisp symbols into groups such that any member in a 
group can be replaced by another through the mutation operator in genetic 
programming: 

{+, X3, NOR, *, XO, 5.5, SQRT, /, X5, SIN, -, -4.5, NOT, OR, 2.7, TAN} 
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Computer exercises 


Several of the exercises use the data in the following table. 


Wy wa 


XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 
XXXXX XXXXX 


@ Section 7.2 


1. Consider the problem of searching for a global minimum of the energy given in 
Eq. 1 for a system of N units, fully interconnected by weights randomly chosen in the 
range —1/VN < wij < +1/VN. Let N = 10. 


(a) Write a program to search through all 2" configurations to find global minima, 
and apply it to your network. Verify that there are two “global” minima. 


(b) Write a program to perform the following version of gradient descent. Let 
the units be numbered and ordered 7 = 1,...,N for bookkeeping. For each 
configuration, find the unit with the lowest index i which can be changed to 
lower the total energy. Iteratively make this change until the system converges, 
or it is clear that it will not converge. 


(c) Perform a search as in part (b) but with random polling of units. 
(d) Repeat parts (a — c) for N = 100 and N = 1000. 


(e) Discuss your results, paying particular attention to convergence and the problem 
of local minima. 


2. Algorithm 1 
Q Section 7.3 


3. Train a Boltzmann network consisting of eight input units and ten category units 
with the characters of a seven-segment display shown in Fig. 7.10. 


(a) Use the network to classify each of the ten patterns, and thus verify that all 
have been learned. 


(b) Explore pattern completion in your network the following way. For each of the 
28 possible patterns do pattern completion for several characters. Add hidden 
units and show that better performance results for ambiguous characters 
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4. ;laskjdf 
Q Section 7.4 


5. Iskjdf 
Q Section 7.5 


6. lksdfj 
Q Section 7.6 


7. Consider a two-category problem with four features bounded region —1 < x; < +1 
for i = 1,2,3,4. 


(a) Generate training points in each of two categories defined by 


Wy: 11 +0.5x%9 — 0.323 — 0.174 < 0.5 
wa: 21 + 0.2% + £3 — 0.6x4 < 0.2 


by randomly selecting a point in the four-dimensional space. Ifit satisfies neither 
of the two inequalities, delete the point. If it satisfies just one of the inequalities, 
label its category accordingly. If it satisfies both inequalities, randomly choose 
a label with probability 0.5. If it satisfies neither of the inequalities, discard the 
point. Continue in this way until you have 50 points for each category. 


(b) GP 
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Chapter 8 


Non-metric Methods 


8.1 Introduction 


e have considered pattern recognition based on feature vectors of real-valued 
V ' and discrete-valued numbers, and in all cases there has been a natural measure 
of distance between such vectors. For instance in the nearest-neighbor classifier the 
notion figures conspicuously — indeed it is the core of the technique — while for 
neural networks the notion of similarity appears when two input vectors sufficiently 
“close” lead to similar outputs. Most practical pattern recognition methods address 
problems of this sort, where feature vectors are real-valued and there exists some 
notion of metric. 

But suppose a classification problem involves nominal data — for instance descrip- 
tions that are discrete and without any natural notion of similarity or even ordering. 
Consider the use of information about teeth in the classification of fish and sea mam- 
mals. Some teeth are small and fine (as in baleen whales) for straining tiny prey from 
the sea. Others (as in sharks) coming in multiple rows. Some sea creatures, such as 
walruses, have tusks. Yet others, such as squid, lack teeth altogether. There is no 
clear notion of similarity (or metric) for this information about teeth: it is meaning- 
less to consider the teeth of a baleen whale any more similar to or different from the 
tusks of a walrus, than it is the distinctive rows of teeth in a shark from their absence 
in a squid, for example. 

Thus in this chapter our attention turns away from describing patterns by vec- 
tors of real numbers and towardusing lists of attributes. A common approach is 
to specify the values of a fixed number of properties by a property d-tuple For ex- 
ample, consider describing a piece of fruit by the four properties of color, texture, 
taste and smell. Then a particular piece of fruit might be described by the 4-tuple 
{red, shiny, sweet, small}, which is a shorthand for color = red, texture = shiny, 
taste = sweet and size = small. Another common approach is to describe the pat- 
tern by a variable length string of nominal attributes, such as a sequence of base pairs 
in a segment of DNA, e.g., “AGCTTCAGATTCCA.”* Such lists or strings might be them- 
selves the output of other component classifiers of the type we have seen elsewhere. 
For instance, we might train a neural network to recognize different component brush 


* We often put strings between quotation marks, particularly if this will help to avoid ambiguities. 
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strokes used in Chinese and Japanese characters (roughly a dozen basic forms); a 
classifier would then accept as inputs a list of these nominal attributes and make the 
final, full character classification. 

How can we best use such nominal data for classification? Most importantly, how 
can we efficiently learn categories using such non-metric data? If there is structure in 
strings, how can it be represented? In considering such problems, we move beyond the 
notion of continuous probability distributions and metrics toward discrete problems 
that are addressed by rule-based or syntactic pattern recognition methods. 


8.2 Decision trees 


It is natural and intuitive to classify a pattern through a sequence of questions, 
in which the next question asked depends on the answer to the current question. 
This “20-questions” approach is particularly useful for non-metric data, since all 
of the questions can be asked in a “yes/no” or “true/false”or “value(property) € 
set_of_values” style that does not require any notion of metric. 

Such a sequence of questions is displayed in a directed decision tree or simply tree, 
where by convention the first or root node is displayed at the top, connected by succes- 
sive (directional) links or branches to other nodes. These are similarly connected until 
we reach terminal or leaf nodes, which have no further links (Fig. 8.1). Sections 8.3 & 
8.4 describe some generic methods for creating such trees, but let us first understand 
how they are used for classification. The classification of a particular pattern begins 
at the root node, which asks for the value of a particular property of the pattern. The 
different links from the root node corresopnd to the different possible values. Based 
on the answer we follow the appropriate link to a subsequent or descendent node. In 
the trees we shall discuss, the links must be mutually distinct and exhaustive, i.e., 
one and only one link will be followed. The next step is to make the decision at the 
appropriate subsequent node, which can be considered the root of a sub-tree. We 
continue this way until we reach a leaf node, which has no further question. Each leaf 
node bears a category label and the test pattern is assigned the category of the leaf 
node reached. 

The simple decision tree in Fig. 8.1 illustrates one benefit of trees over many other 
classifiers such as neural networks: interpretability. It is a straightforward matter 
to render the information in such a tree as logical expressions. Such interpretability 
has two manifestations. First, we can easily interpret the decision for any particular 
test pattern as the conjunction of decisions along the path to its corresponding leaf 
node. Thus if the properties are {taste, color, shape, size}, the pattern x = {sweet, 
yellow, thin, medium} is classified as Banana because it is (color = yellow) AND 
(shape = thin).* Second, we can occasionally get clear interpretations of the cate- 
gories themselves, by creating logical descriptions using conjunctions and disjunctions 
(Problem 8). For instance the tree shows Apple = (green AND medium) OR (red 
AND medium). 

Rules derived from trees — especially large trees — are often quite complicated 
and must be reduced to aid interpretation. For our example, one simple rule describes 
Apple = (medium AND NOT yellow). Another benefit of trees is that they lead to 


* We retain our convention of representing patterns in boldface even though they need not be true 
vectors, i.e., they might contain nominal data that cannot be added or multiplied the way vector 
components can. For this reason we use the terms “attribute” to represent both nominal data and 
real-valued data, and reserve “feature” for real-valued data. 
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level 0 
level 1 
small 
Watermelon Apple Grape Apple Taste» level 2 
big sweet sour 
Grapefruit Lemon Cherry Grape level 3 


Figure 8.1: Classification in a basic decision tree proceeds from top to bottom. The 
questions asked at each node concern a particular property of the pattern, and the 
downward links correspond to the possible values. Successive nodes are visited until a 
terminal or leaf node is reached, where the category label is read. Note that the same 
question, Size?, appears in different places in the tree, and that different questions 
can have different numbers of branches. Moreover, different leaf nodes, shown in pink, 
can be labeled by the same category (e.g., Apple). 


rapid classification, employing a sequence of typically simple queries. Finally, we note 
that trees provide a natural way to incorporate prior knowledge from human experts. 
In practice, though, such expert knowledge if of greatest use when the classification 
problem is fairly simple and the training set is small. 


83 CART 


Now we turn to the matter of using training data to create or “grow” a decision tree. 
We assume that we have a set D of labeled training data and we have decided on a 
set of properties that can be used to discriminate patterns, but do not know how to 
organize the tests into a tree. Clearly, any decision tree will progressively split the 
set of training examples into smaller and smaller subsets. It would be ideal if all the 
samples in each subset had the same category label. In that case, we would say that 
each subset was pure, and could terminate that portion of the tree. Usually, however, 
there is a mixture of labels in each subset, and thus for each branch we will have 
to decide either to stop splitting and accept an imperfect decision, or instead select 
another property and grow the tree further. 

This suggests an obvious recursive tree-growing process: given the data repre- 
sented at a node, either declare that node to be a leaf (and state what category to 
assign to it), or find another property to use to split the data into subsets. How- 
ever, this is only one example of a more generic tree-growing methodology know as 
CART (Classification and Regression Trees). CART provides a general framework 
that can be instatiated in various ways to produce different decision trees. In the 
CART approach, six general kinds of questions arise: 


1. Should the properties be restricted to binary-valued or allowed to be multi- 
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valued? That is, how many decision outcomes or splits will there be at a node? 
2. Which property should be tested at a node? 
3. When should a node be declared a leaf? 


4. If the tree becomes “too large,” how can it be made smaller and simpler, i.e., 
pruned? 


5. If a leaf node is impure, how should the category label be assigned? 


6. How should missing data be handled? 


We consider each of these questions in turn. 


8.3.1 Number of splits 


Each decision outcome at a node is called a split, since it corresponds to splitting a 
subset of the training data. The root node splits the full training set; each successive 
decision splits a proper subset of the data. The number of splits at a node is closely 
related to question 2, specifying which particular split will be made at a node. In 
general, the number of splits is set by the designer, and could vary throughout the tree, 
as we saw in Fig. 8.1. The number of links descending from a node is sometimes called 
the node’s branching factor or branching ratio, denoted B. However, every decision 
(and hence every tree) can be represented using just binary decisions (Problem 2). 
Thus the root node querying fruit color (B = 3) in our example could be replaced by 
two nodes: the first would ask fruit = green?, and at the end of its “no” branch, 
another node would ask fruit = yellow?. Because of the universal expressive power 
of binary trees and the comparative simplicity in training, we shall concentrate on 
such trees (Fig. 8.2). 


8.3.2 Test selection and node impurity 


Much of the work in designing trees focuses on deciding which property test or query 
should be performed at each node.* With non-numeric data, there is no geometrical 
interpretation of how the test at a node splits the data. However, for numerical 
data, there is a simple way to visualize the decision boundaries that are produced 
by decision trees. For example, suppose that the test at each node has the form “is 
Xx; < Tis?” This leads to hyperplane decision boundaries that are perpendicular to the 
coordinate axes, and to decision regions of the form illustrated in Fig. 8.3. 

The fundamental principle underlying tree creation is that of simplicity: we prefer 
decisions that lead to a simple, compact tree with few nodes. This is a version of 
Occam’s razor, that the simplest model that explains data is the one to be preferred 
(Chap. ??). To this end, we seek a property test T at each node N that makes the 
data reaching the immediate descendent nodes as “pure” as possible. In formalizing 
this notion, it turns out to be more conveninet to define the impurity, rather than 


* The problem is further complicated by the fact that there is no reason why the test at a node 
has to involve only one property. One might well consider logical combinations of properties, such 
as using (size = medium) AND (NOT (color = yellow))? as a test. Trees in which each test is 
based on a single property are called monothetic; if the query at any of the nodes involves two or 
more properties, the tree is called polythetic. For simplicity, we generally restrict our treatment to 
monothetic trees. In all cases, the key requirement is that the decision at a node be well-defined 
and unambiguous so that the response leads down one and only one branch. 
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color = yellow? 


yes S no 


size = medium? shape = round? 


Watermelon 


no 


Apple Grape 


size = big? 


yes 
Grapefruit Lemon Cherry Grape 


Figure 8.2: A tree with arbitrary branching factor at different nodes can always be 
represented by a functionally equivalent binary tree, i.e., one having branching factor 
B = 2 throughout. By convention the “yes” branch is on the left, the “no” branch on 
the right. This binary tree contains the same information and implements the same 
classification as that in Fig. 8.1. 


the purity of a node. Several different mathematical measures of impurity have been 
proposed, all of which have basically the same behavior. Let i(V) denote the impurity 
of anode N. In all cases, we want i( N) to be 0 if all of the patterns that reach the node 
bear the same category label, and to be large if the categories are equally represented. 

The most popular measure is the entropy impurity (or occasionally information 
impurity): 


i(N) = Y Pu) logy P(w)), (1) 


where P(w,;) is the fraction of patterns at node N that are in category w;.* By 
the well-known properties of entropy, if all the patterns are of the same category, 
the impurity is 0; otherwise it is positive, with the greatest value occuring when the 
different classes are equally likely. 

Another definition of impurity is particularly useful in the two-category case. 
Given the desire to have zero impurity when the node represents only patterns of 
a single category, the simplest polynomial form is: 


i(N) = P(w1)P(w2). (2) 


This can be interpreted as a variance impurity since under reasonable assumptions it 


* Here we are a bit sloppy with notation, since we normally reserve P for probability and P for 
frequency ratios. We could be even more precise by writing P(x € w;|N) — i.e., the fraction 
of training patterns x at node N that are in category wj, given that they have survived all the 
previous decisions that led to the node N — but for the sake of simplicity we sill avoid such 
notational overhead. 
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Figure 8.3: Monothetic decision trees create decision boundaries with portions per- 
pendicular to the feature axes. The decision regions are marked Ry and Ra in these 
two-dimensional and three-dimensional two-category examples. With a sufficiently 
large tree, any decision boundary can be approximated arbitrarily well. 


is related to the variance of a distribution associated with the two categories (Prob- 
lem 10). A generalization of the variance impurity, applicable to two or more cate- 
gories, is the Gini impurity: 


i(N) = X Pi) Ps) =1- Y P(w). (3) 
¡Aj j 
This is just the expected error rate at node N if the category label is selected randomly 
from the class distribution present at N. This criterion is more strongly peaked at 
equal probabilities than is the entropy impurity (Fig. 8.4). 
The misclassification impurity can be written as 


i(N) = 1 — max P(w), (4) 
j 

and measures the minimum probability that a training pattern would be misclassified 
at N. Of the impurity measures typically considered, this measure is the most strongly 
peaked at equal probabilities. It has a discontinuous derivative, though, and this can 
present problems when searching for an optimal decision over a continuous parameter 
space. Figure 8.4 shows these impurity functions for a two-category case, as a function 
of the probability of one of the categories. 

We now come to the key question — given a partial tree down to node N, what 
value s should we choose for the property test T? An obvious heuristic is to choose 
the test that decreases the impurity as much as possible. The drop in impurity is 
defined by 


Ai(N) = i(N) — Pri(Nz) — (1 — Pr)i(Np), (5) 


where Nz and Np are the left and right descendent nodes, i(Nz) and ¿(Np) their 
impurities, and Pzr is the fraction of patterns at node N that will go to Nz when 
property test T is used. Then the “best” test value s is the choice for T that maximizes 
Ai(T). If the entropy impurity is used, then the impurity reduction corresponds to an 
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Figure 8.4: For the two-category case, the impurity functions peak at equal class fre- 
quencies and the variance and the Gini impurity functions are identical. To facilitate 
comparisons, the entropy, variance, Gini and misclassification impurities (given by 
Eqs. 1 — 4, respectively) have been adjusted in scale and offset to facilitate compari- 
son; such scale and offset does not directly affect learning or classification. 


information gain provided by the query. Since each query in a binary tree is a single 
“yes/no” one, the reduction in entropy impurity due to a split at a node cannot be 
greater than one bit (Problem 5). 

The way to find an optimal decision for a node depends upon the general form of 
decision. Since the decision criteria are based on the extrema of the impurity func- 
tions, we are free to change such a function by an additive constant or overall scale 
factor and this will not affect which split is found. Designers typically choose functions 
that are easy to compute, such as those based on a single feature or attribute, giving 
a monothetic tree. If the form of the decisions is based on the nominal attributes, 
we may have to perform extensive or exhaustive search over all possible subsets of 
the training set to find the rule maximizing Ai. If the attributes are real-valued, 
one could use gradient descent algorithms to find a splitting hyperplane (Sect. 8.3.8), 
giving a polythetic tree. An important reason for favoring binary trees is that the 
decision at any node can generally be cast as a one-dimensional optimization problem. 
If the branching factor B were instead greater than 2, a two- or higher-dimensional 
optimization would be required; this is generally much more difficult (Computer ex- 
ercise ??). 

Sometimes there will be several decisions s that lead to the same reduction in 
impurity and the question arises how to choose among them. For example, if the 
features are real-valued and a split lying anywhere in a range a < £s < Tu for 
the x variable leads to the same (maximum) impurity reduction, it is traditional to 
choose either the midpoint or the weighted average — £s = (a + £u)/2 or £s = 
(1 — P)a, + xP, respectively — where P is the probability a pattern goes to the 
“left” under the decision. Computational simplicity may be the determining factor as 
there are rarely deep theoretical reasons to favor one over another. 

Note too that the optimization of Eq. 5 is local — done at a single node. As with 
the vast majority of such greedy methods, there is no guarantee that successive locally 
optimal decisions lead to the global optimum. In particular, there is no guarantee 
that after training we have the smallest tree (Computer exercise ??). Nevertheless, 
for every reasonable impurity measure and learning method, we can always continue 
to split further to get the lowest possible impurity at the leafs (Problem ??). There 
is no assurance that the impurity at a leaf node will be the zero, however: if two 
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patterns have the same attribute description yet come from different categories, the 
impurity will be greater than zero. 

Occasionally during tree creation the misclassification impurity (Eq. 4) will not 
decrease whereas the Gini impurity would (Problem ??); thus although classification 
is our final goal, we may prefer the Gini impurity because it “anticipates” later splits 
that will be useful. Consider a case where at node N there are 90 patterns in wı and 
10 in w2. Thus the misclassification impurity is 0.1. Suppose there are no splits that 
guarantee a wz majority in either of the two descendent nodes. Then the misclassifi- 
cation remains at 0.1 for all splits. Now consider a split which sends 70 w; patterns 
to the right along with 0 wa patterns, and sends 20 w; and 10 wə to the left. This is 
an attractive split but the misclassification impurity is still 0.1. On the other hand, 
the Gini impurity for this split is less than the Gini for the parent node. In short, 
the Gini impurity shows that this as a good split while the misclassification rate does 
not. 

In multiclass binary tree creation, the twoing criterion may be useful.* The overall 
goal is to find the split that best splits groups of the c categories, i.e., a candidate 
“supercategory” Cı consisting of all patterns in some subset of the categories, and 
candidate “supercategory” C2 as all remaining patterns. Let the class of categories 
be C = {w1,we,...,wWe}. At each node, the decision splits the categories into Cy = 
{wi,,Wig,-+-, Wi, } and Co = C — C1. For every candidate split s, we compute a change 
in impurity Ai(s,C,) as though it corresponded to a standard two-class problem. That 
is, we find the split s*(C¡) that maximizes the change in impurity. Finally, we find 
the supercategory Cf which maximizes Ai(s*(C1), C1). The benefit of this impurity is 
that it is strategic — it may learn the largest scale structure of the overall problem 
(Problem 4). 

It may be surprising, but the particular choice of an impurity function rarely seems 
to affect the final classifier and its accuracy. An entropy impurity is frequently used 
because of its computational simplicity and basis in information theory, though the 
Gini impurity has received significant attention as well. In practice, the stopping 
criterion and the pruning method — when to stop splitting nodes, and how to merge 
leaf nodes — are more important than the impurity function itself in determining 
final classifier accuracy, as we shall see. 


Multi-way splits 


Although we shall concentrate on binary trees, we briefly mention the matter of 
allowing the branching ratio at each node to be set during training, a technique will 
return to in a discussion of the ID3 algorithm (Sect. 8.4.1). In such a case, it is 
tempting to use a multi-branch generalization of Eq. 5 of the form 


B 
Ai(s) = i(N) — X Pri( Na), (6) 
k=1 


where P; is the fraction of training patterns sent down the link to node Nz, and 
B 
Y Py = 1. However, the drawback with Eq. 6 is that decisions with large B are 
k=1 


inherently favored over those with small B whether or not the large B splits in fact 
represent meaningful structure in the data. For instance, even in random data, a 


* The twoing criterion is not a true impurity measure. 
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high-B split will reduce the impurity more than will a low-B split. To avoid this 
drawback, the candidate change in impurity of Eq. 6 must be scaled, according to 


Ai(s) 


Aip(s) = — =. 
= y Py 1089 Pk 
k=1 


(7) 


a method based on the gain ratio impurity (Problem 17). Just as before, the optimal 
split is the one maximizing Aip(s). 


8.3.3 When to stop splitting 


Consider now the problem of deciding when to stop splitting during the training of 
a binary tree. If we continue to grow the tree fully until each leaf node corresponds 
to the lowest impurity, then the data has typically been overfit (Chap. ??). In the 
extreme but rare case, each leaf corresponds to a single training point and the full tree 
is merely a convenient implementation of a lookup table; it thus cannot be expected 
to generalize well in (noisy) problems having high Bayes error. Conversely, if splitting 
is stopped too early, then the error on the training data is not sufficiently low and 
hence performance may suffer. 

How shall we decide when to stop splitting? One traditional approach is to use 
techniques of Chap. ??, in particular cross-validation. That is, the tree is trained 
using a subset of the data (for instance 90%), with the remaining (10%) kept as a 
validation set. We continue splitting nodes in successive layers until the error on the 
validation data is minimized. 

Another method is to set a (small) threshold value in the reduction in impurity; 
splitting is stopped if the best candidate split at a node reduces the impurity by 
less than that pre-set amount, i.e., if max, Ai(s) < 8. This method has two main 
benefits. First, unlike cross-validation, the tree is trained directly using all the training 
data. Second, leaf nodes can lie in different levels of the tree, which is desirable 
whenever the complexity of the data varies throughout the range of input. (Such an 
unbalanced tree requires a different number of decisions for different test patterns. ) 
A fundamental drawback of the method, however, is that it is often difficult to know 
how to set the threshold because there is rarely a simple relationship between P and 
the ultimate performance (Computer exercise 2). A very simple method is to stop 
when a node represents fewer than some threshold number of points, say 10, or some 
fixed percentage of the total training set, say 5%. This has a benefit analogous to 
that in k-nearest-neighbor classifiers (Chap. ??); that is, the size of the partitions is 
small in regions where data is dense, but large where the data is sparse. 

Yet another method is to trade complexity for test accuracy by splitting until a 
minimum in a new, global criterion function, 


a+ size+D  i(N), (8) 


leaf nodes 


is reached. Here size could represent the number of nodes or links and a is some 
positive constant. (This is analogous to regularization methods in neural networks 
that penalize connection weights or nodes.) If an impurity based on entropy is used 
for ¿(N), then Eq. 8 finds support from minimum description length (MDL), which 
we shall consider again in Chap. ??. The sum of the impurities at the leaf nodes is a 
measure of the uncertainty (in bits) in the training data given the model represented 
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by the tree; the size of the tree is a measure of the complexity of the classifier itself 
(which also could be measured in bits). A difficulty, however, is setting a, as it is not 
always easy to find a simple relationship between a and the final classifier performance 
(Computer exercise 3). 

An alternative approach is to use a stopping criterion based on the statistical 
significance of the reduction of impurity. During tree construction, we estimate the 
distribution of all the Az for the current collection of nodes; we assume this is the 
full distribution of Az. For any candidate node split, we then determine whether it 
is statistically different from zero, for instance by a chi-squared test (cf. Sect. ??). 
If a candidate split does not reduce the impurity significantly, splitting is stopped 
(Problem 15). 

A variation in this technique of hypothesis testing can be applied even without 
strong assumptions on the distribution of Ai. We seek to determine whether a can- 
didate split is “meaningful,” that is, whether it differs significantly from a random 
split. Suppose n patterns survive at node N (with nı in wı and na in wa); we wish to 
decide whether a candidate split s differs significantly from a random one. Suppose 
a particular candidate split s sends Pn patterns to the left branch, and (1 — P)n to 
the right branch. A random split having this probability (i.e., the null hypothesis) 
would place Pn, of the wı patterns and Png of the wa patterns to the left, and the 
remaining to the right. We quantify the deviation of the results due to candidate split 
s from the (weighted) random split by means of the chi-squared statistic, which in 
this two-category case is 


2 - (niL = nie)” 
v=), (9) 
¿=1 Mie 

where niz is the number of patterns in category w; sent to the left under decision s, 
and nie = Pn; is the number expected by the random rule. The chi-squared statistic 
vanishes if the candidate split s gives the same distribution as the random one, and 
is larger the more s differs from the random one. When x? is greater than a critical 
value, as given in a table (cf. Table ??), then we can reject the null hypothesis since 
s differs “significantly” at some probability or confidence level, such as .01 or .05. 
The critical values of the confidence depend upon the number of degrees of freedom, 
which in the case just described is 1, since for a given probability P the single value 
nız specifies all other values (nır, naz and nar). If the “most significant” split at a 
node does not yield a x? exceeding the chosen confidence level threshold, splitting is 
stopped. 


8.3.4 Pruning 


Occassionally, stopped splitting suffers from the lack of sufficient look ahead, a phe- 
nomenon called the horizon effect. The determination of the optimal split at a node 
N is not influenced by decisions at N’s descendent nodes, i.e., those at subsequent 
levels. In stopped splitting, node N might be declared a leaf, cutting off the possi- 
bility of beneficial splits in subsequent nodes; as such, a stopping condition may be 
met “too early” for overall optimal recognition accuracy. Informally speaking, the 
stopped splitting biases the learning algorithm toward trees in which the greatest 
impurity reduction is near the root node. 

The principal alternative approach to stopped splitting is pruning. In pruning, a 
tree is grown fully, that is, until leaf nodes have minimum impurity — beyond any 
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putative “horizon.” Then, all pairs of neighboring leaf nodes (i.e., ones linked to a 
common antecedent node, one level above) are considered for elimination. Any pair 
whose elimination yields a satisfactory (small) increase in impurity is eliminated, and 
the common antecedent node declared a leaf. (This antecedent, in turn, could itself 
be pruned.) Clearly, such merging or joining of the two leaf nodes is the inverse of 
splitting. It is not unusual that after such pruning, the leaf nodes lie in a wide range 
of levels and the tree is unbalanced. 

Although it is most common to prune starting at the leaf nodes, this is not nec- 
essary: cost-complexity pruning can replace a complex subtree with a leaf directly. 
Further, C4.5 (Sect. 8.4.2) can eliminate an arbitrary test node, thereby replacing a 
subtree by one of its branches. 

The benefits of pruning are that it avoids the horizon effect; further, since there 
is no training data held out for cross-validation, it directly uses all information in the 
training set. Naturally, this comes at a greater computational expense than stopped 
splitting, and for problems with large training sets, the expense can be prohibitive 
(Computer exercise ??). For small problems, though, these computational costs are 
low and pruning is generally to be preferred over stopped splitting. Incidentally, what 
we have been calling stopped training and pruning are sometimes called pre-pruning 
and post-pruning, respectively. 

A conceptually different pruning method is based on rules. Each leaf has an 
associated rule — the conjunction of the individual decisions from the root node, 
through the tree, to the particular leaf. Thus the full tree can be described by a large 
list of rules, one for each leaf. Occasionally, some of these rules can be simplified 
if a series of decisions is redundant. Eliminating the irrelevant precondition rules 
simplifies the description, but has no influence on the classifier function, including 
its generalization ability. The predominant reason to prune, however, is to improve 
generalization. In this case we therefore eliminate rules so as to improve accuracy on a 
validation set (Computer exercise 6). This technique may even allow the elimination 
of a rule corresponding to a node near the root. 

One of the benefits of rule pruning is that it allows us to distinguish between the 
contexts in which any particular node N is used. For instance, for some test pattern 
x, the decision rule at node N is necessary; for another test pattern x2 that rule is 
irrelevant and thus N could be pruned. In traditional node pruning, we must either 
keep N or prune it away. In rule pruning, however, we can eliminate it where it is 
not necessary (i.e., for patterns such as xı) and retain it for others (such as x2). 

A final benefit is that the reduced rule set may give improved interpretability. 
Although rule pruning was not part of the original CART approach, such pruning 
can be easily applied to CART trees. We shall consider an example of rule pruning 
in Sect. 8.4.2. 


8.3.5 Assignment of leaf node labels 


Assigning category labels to the leaf nodes is the simplest step in tree construction. If 
successive nodes are split as far as possible, and each leaf node corresponds to patterns 
in a single category (zero impurity), then of course this category label is assigned to 
the leaf. In the more typical case, where either stopped splitting or pruning is used 
and the leaf nodes have positive impurity, each leaf should be labeled by the category 
that has most points represented. An extremely small impurity is not necessarily 
desirable, since it may be an indication that the tree is overfitting the training data. 


Example 1 illustrates some of these steps. 
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Example 1: A simple tree classifier | 


Consider the following n = 16 points in two dimensions for training a binary 
CART tree (B = 2) using the entropy impurity (Eq. 1). 
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Training data and associated (unpruned) tree are shown at the top. The entropy 
impurity at non-terminal nodes is shown in red and the impurity at each leaf is 0. If 
the single training point marked * were instead slightly lower (marked '), the resulting 
tree and decision regions would differ significantly, as shown at the bottom. 
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The impurity of the root node is 


2 
i(Nroot) = — )_, P(wi)log, P(w;) = —[.5logy.5 + .5log2.5] = 1.0. 


i=l 


For simplicity we consider candidate splits parallel to the feature axes, i.e., of the form 
“is £i < 2is?”. By exhaustive search of the n—1 positions for the x; feature and n— 1 
positions for the x2 feature we find by Eq. 5 that the greatest reduction in the impurity 
occurs near 71, = 0.6, and hence this becomes the decision criterion at the root node. 
We continue for each sub-tree until each final node represents a single category (and 
thus has the lowest impurity, 0), as shown in the figure. If pruning were invoked, 
the pair of leaf nodes at the left would be the first to be deleted (gray shading) since 
there the impurity is increased the least. In this example, stopped splitting with the 
proper threshold would also give the same final network. In general, however, with 
large trees and many pruning steps, pruning and stopped splitting need not lead to 
the same final tree. 

This particular training set shows how trees can be sensitive to details of the train- 
ing points. If the w2 point marked * in the top figure is moved slightly (marked '), the 
tree and decision regions differ significantly, as shown at the bottom. Such instability 
is due in large part to the discrete nature of decisions early in the tree learning. 


Example 1 illustrates the informal notion of instability or sensitivity to training 
points. Of course, if we train any common classifier with a slightly different training 
set the final classification decisions will differ somewhat. If we train a CART classifier, 
however, the alteration of even a single training point can lead to radically different 
decisions overall. This is a consequence of the discrete and inherently greedy nature 
of such tree creation. Instability often indicates that incremental and off-line versions 
of the method will yield significantly different classifiers, even when trained on the 
same data. 


8.3.6 Computational complexity 


Suppose we have n training patterns in d dimensions in a two-category problem, and 
wish to construct a binary tree based on splits parallel to the feature axes using an 
entropy impurity. What are the time and the space complexities? 

At the root node (level 0) we must first sort the training data, O(nlogn) for each of 
the d features or dimensions. The entropy calculation is O(n) + (n — 1)O(d) since we 
examine n — 1 possible splitting points. Thus for the root node the time complexity 
is O(dnlogn). Consider an average case, where roughly half the training points are 
sent to each of the two branches. The above analysis implies that splitting each 
node in level 1 has complexity O(d n/2 log(n/2)); since there are two such nodes 
at that level, the total complexity is O(dnlog(n/2)). Similarly, for the level 2 we 
have O(dnlog(n/4)), and so on. The total number of levels is O(log n). We sum the 
terms for the levels and find that the total average time complexity is O(dn (log n)?). 
The time complexity for recall is just the depth of the tree, i.e., the total number 
of levels, is O(log n). The space complexity is simply the number of nodes, which, 
given some simplifying assumptions (such as a single training point per leaf node), is 
14+42+4+..+n/2= nm, that is, O(n) (Problem 9). 


STABILITY 


16 CHAPTER 8. NON-METRIC METHODS 


We stress that these assumptions (for instance equal splits at each node) rarely 
hold exactly; moreover, heuristics can be used to speed the search for splits dur- 
ing training. Nevertheless, the result that for fixed dimension d the training is 
O(dn? log n) and classification O(log n) is a good rule of thumb; it illustrates how 
training is far more computationally expensive than is classification, and that on 
average this discrepancy grows as the problem gets larger. 

There are several techniques for reducing the complexity during the training of 
trees based on real-valued data. One of the simplest heuristics is to begin the search 
for splits zis at the “middle” of the range of the training set, moving alternately 
to progressively higher and lower values. Optimal splits always occur for decision 
thresholds between adjacent points from different categories and thus one should test 
only such ranges. These and related techniques generally provide only moderate 
reductions in computation (Computer exercise ??). When the patterns consist of 
nominal data, candidate splits could be over every subset of attributes, or just a 
single entry, and the computational burden is best lowered using insight into features 
(Problem 3). 


8.3.7 Feature choice 


As with most pattern recognition techniques, CART and other tree-based methods 
work best if the “proper” features are used (Fig. 8.5). For real-valued vector data, 
most standard preprocessing techniques can be used before creating a tree. Pre- 
processing by principal components (Chap. ??) can be effective, since it finds the 
“important” axes, and this generally leads to simple decisions at the nodes. If how- 
ever the principal axes in one region differ significantly from those in another region, 
then no single choice of axes overall will suffice. In that case we may need to employ 
the techniques of Sect. 8.3.8, for instance allowing splits to be at arbitrary orientation, 
often giving smaller and more compact trees. 


8.3.8 Multivariate decision trees 


If the “natural” splits of real-valued data do not fall parallel to the feature axes or the 
full training data set differs significantly from simple or accommodating distributions, 
then the above methods may be rather inefficient and lead to poor generalization 
(Fig. 8.6); even pruning may be insufficient to give a good classifier. The simplest 
solution is to allow splits that are not parallel to the feature axes, such as a general 
linear classifier trained via gradient descent on a classification or sum-squared-error 
criterion (Chap. ??). While such training may be slow for the nodes near the root if 
the training set is large, training will be faster at nodes closer to the leafs since less 
training data is used. Recall can remain quite fast since the linear functions at each 
node can be computed rapidly. 


8.3.9 Priors and costs 


Up to now we have tacitly assumed that a category w; is represented with the same 
frequency in both the training and the test data. If this is not the case, we need 
a method for controlling tree creation so as to have lower error on the actual final 
classification task when the frequencies are different. The most direct method is to 
“weight” samples to correct for the prior frequencies (Problem 16). Furthermore, 
we may seek to minimize a general cost, rather than a strict misclassification or 0-1 
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Figure 8.5: If the class of node decisions does not match the form of the training data, 
a very complicated decision tree will result, as shown at the top. Here decisions are 
parallel to the axes while in fact the data is better split by boundaries along another 
direction. If however “proper” decision forms are used (here, linear combinations of 
the features), the tree can be quite simple, as shown at the bottom. 


cost. As in Chap. ??, we represent such information in a cost matrix ;; — the 
cost of classifying a pattern as w; when it is actually w;. Cost information is easily 
incorporated into a Gini impurity, giving the following weighted Gini impurity, 


i(N) = 2 AijPlw)Plwy), (10) 


which should be used during training. Costs can be incorporated into other impurity 
measures as well (Problem 11). 
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Figure 8.6: One form of multivariate tree employs general linear decisions at each 
node, giving splits along arbitrary directions in the feature space. In virtually all 
interesting cases the training data is not linearly separable, and thus the LMS algo- 
rithm is more useful than methods that require the data to be linearly separable, even 
though the LMS need not yield a minimum in classification error (Chap. ??). The 
tree at the bottom can be simplified by methods outlined in Sect. 8.4.2. 


8.3.10 Missing attributes 


Classification problems might have missing attributes during training, during classi- 
fication, or both. Consider first training a tree classifier despite the fact that some 
training patterns are missing attributes. A naive approach would be to delete from 
consideration any such deficient patterns; however, this is quite wasteful and should be 
employed only if there are many complete patterns. A better technique is to proceed 
as otherwise described above (Sec. 8.3.2), but instead calculate impurities at a node 
N using only the attribute information present. Suppose there are n training points 
at N and that each has three attributes, except one pattern that is missing attribute 
x3. To find the best split at N, we calculate possible splits using all n points using 
attribute x1, then all n points for attribute x2, then the n— 1 non-deficient points for 
attribute 13. Each such split has an associated reduction in impurity, calculated as 
before, though here with different numbers of patterns. As always, the desired split 
is the one which gives the greatest decrease in impurity. The generalization of this 
procedure to more features, to multiple patterns with missing attributes, and even to 
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patterns with several missing attributes is straightforward, as is its use in classifying 
non-deficient patterns (Problem 14). 

Now consider how to create and use trees that can classify a deficient pattern. The 
trees described above cannot directly handle test patterns lacking attributes (but see 
Sect. 8.4.2), and thus if we suspect that such deficient test patterns will occur, we 
must modify the training procedure discussed in Sect. 8.3.2. The basic approach 
during classification is to use the traditional (“primary”) decision at a node whenever 
possible (i.e., when the queries involves a feature that is present in the deficient test 
pattern) but to use alternate queries whenever the test pattern is missing that feature. 

During training then, in addition to the primary split, each non-terminal node 
N is given an ordered set of surrogate splits, consisting of an attribute label and a 
rule. The first such surrogate split maximizes the “predictive association” with the 
primary split. A simple measure of the predictive association of two splits sı and 
S2 is merely the numerical count of patterns that are sent to the “left” by both sı 
and s2 plus the count of the patterns sent to the “right” by both the splits. The 
second surrogate split is defined similarly, being the one which uses another feature 
and best approximates the primary split in this way. Of course, during classification 
of a deficient test pattern, we use the first surrogate split that does not involve the test 
pattern’s missing attributes. This missing value strategy corresponds to a linear model 
replacing the pattern’s missing value by the value of the non-missing attribute most 
strongly correlated with it (Problem ??). This strategy uses to maximum advantage 
the (local) associations among the attributes to decide the split when attribute values 
are missing. A method closely related to surrogate splits is that of virtual values, in 
which the missing attribute is assigned its most likely value. 


Example 2: Surrogate splits and missing attributes | 


Consider the creation of a monothetic tree using an entropy impurity and the 
following ten training points. Since the tree will be used to classify test patterns with 
missing features, we will give each node surrogate splits. 


X1 X9 X3 X4 X5 
0 1 2 4 5 
Wy 7 > 8 > 9 > Í > 2 
8 9 0 1 2 
yı y2 y3 y4 ys 
3 6 7 8 9 
as 3] al a pls 1 (6 
3 4 5 6 7 


Through exhaustive search along all three features, we find the primary split at the 
root node should be “a, < 5.5?”, which sends {x,,x2,X3,X4,Xs5,yi} to the left and 
{y2, Y3, ya, y5 } to the right, as shown in the figure. 

We now seek the first surrogate split at the root node; such a split must be based 
on either the x2 or the x3 feature. Through exhaustive search we find that the split 
“zz < 3.5?” has the highest predictive association with the primary split — a value 
of 8, since 8 patterns are sent to matching directions by each rule, as shown in the 
figure. The second surrogate split must be along the only remaining feature, 12. We 
find that for this feature the rule “x2 < 3.5?” has the highest predictive association 
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with the primary split, a value of 6. (This, incidentally, is not the optimal x2 split for 
impurity reduction — we use it because it best approximates the preferred, primary 
split.) While the above describes the training of the root node, training of other nodes 
is conceptually the same, though computationally less complex because fewer points 
need be considered. 


primary split first surrogate split second surrogate split 


Xp Xy Xy Xp SY Yo Vx Va V5 Xy XpX5 Vi Yo Vx Va V5 XpXsVp Vx Va V5 
Xp XQ Y Xp Xz X3 

predictive association predictive association 

with primary split = 8 with primary split = 6 


Of all possible splits based on a single feature, the primary split, “x, < 5.5?”, mini- 
mizes the entropy impurity of the full training set. The first surrogate split at the root 
node must use a feature other than z1; its threshold is set in order to best approxi- 
mate the action of the primary split. In this case “zz < 3.5?” is the first surrogate 
split. Likewise, here the second surrogate split must use the xa feature; its threshold 
is chosen to best approximate the action of the primary split. In this case “xa < 3.5?” 
is the second surrogate split. The pink shaded band marks those patterns sent to the 
matching direction as the primary split. The number of patterns in the shading is 
thus the predictive association with the primary split. 


During classification, any test pattern containing feature xı would be queried using 
the primary split, “xı < 5.5?” Consider though the deficient test pattern (x, 2,4)‘, 
where * is the missing x, feature. Since the primary split cannot be used, we turn 
instead to the first surrogate split, “x3 < 3.5?”, which sends this point to the right. 
Likewise, the test pattern (*,2,*)* would be queried by the second surrogate split, 
“za <3.5?”, and sent to the left. 


Sometimes the fact that an attribute is missing can be informative. For instance, 
in medical diagnosis, the fact that an attribute (such as blood sugar level) is missing 
might imply that the physician had some reason not to measure it. As such, a missing 
attribute could be represented as a new feature, and used in classification. 


8.4 Other tree methods 


Virtually all tree-based classification techniques can incorporate the fundamental tech- 
niques described above. In fact that discussion expanded beyond the core ideas in 
the earliest presentations of CART. While most tree-growing algorithms use an en- 
tropy impurity, there are many choices for stopping rules, for pruning methods and 
for the treatment of missing attributes. Here we discuss just two other popular tree 
algorithms. 


8.4.1 ID3 


ID3 received its name because it was the third in a series of identification or “ID” 
procedures. It is intended for use with nominal (unordered) inputs only. If the problem 
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involves real-valued variables, they are first binned into intervals, each interval being 
treated as an unordered nominal attribute. Every split has a branching factor B;, 
where Bj; is the number of discrete attribute bins of the variable j chosen for splitting. 
In practice these are seldom binary and thus a gain ratio impurity should be used 
(Sect. 8.3.2). Such trees have their number of levels equal to the number of input 
variables. The algorithm continues until all nodes are pure or there are no more 
variables to split on. While there is thus no pruning in standard presentations of the 
ID3 algorithm, it is straightforward to incorporate pruning along the ideas presented 
above (Computer exercise 4). 


8.4.2 C4.5 


The C4.5 algorithm, the successor and refinement of ID3, is the most popular in a 
series of “classification” tree methods. In it, real-valued variables are treated the same 
as in CART. Multi-way (B > 2) splits are used with nominal data, as in ID3 with a 
gain ratio impurity based on Eq. 7. The algorithm uses heuristics for pruning derived 
based on the statistical significance of splits. 

A clear difference between C4.5 and CART involves classifying patterns with miss- 
ing features. During training there are no special accommodations for subsequent 
classification of deficient patterns in C4.5; in particular, there are no surrogate splits 
precomputed. Instead, if node N with branching factor B queries the missing feature 
in a deficient test pattern, C4.5 follows all B possible answers to the descendent nodes 
and ultimately B leaf nodes. The final classification is based on the labels of the B 
leaf nodes, weighted by the decision probabilities at N. (These probabilities are sim- 
ply those of decisions at N on the training data.) Each of N’s immediate descendent 
nodes can be considered the root of a sub-tree implementing part of the full classifica- 
tion model. This missing-attribute scheme corresponds to weighting these sub-models 
by the probability any training pattern at N would go to the corresponding outcome 
of the decision. This method does not exploit statistical correlations between different 
features of the training points, whereas the method of surrogate splits in CART does. 
Since C4.5 does not compute surrogate splits and hence does not need to store them, 
this algorithm may be preferred over CART if space complexity (storage) is a major 
concern. 

The C4.5 algorithm has the provision for pruning based on the rules derived from 
the learned tree. Each leaf node has an associated rule — the conjunction of the 
decisions leading from the root node, through the tree, to that leaf. A technique 
called C4.5Rules deletes redundant antecedents in such rules. To understand this, 
consider the left-most leaf in the tree at the bottom of Fig. 8.6, which corresponds to 
the rule 


IF| (0.402; +0.16x2 < 0.11) 
AND (0.2771 — 0.44x9 < —0.02) 
AND (0.9621 — 1.7722 < —0.45) 
AND (5.4321 — 13.3322 < —6.03)] 

THEN X Ew. 


This rule can be simplified to give 


IF| (0.40z1 + 0.16z2 < 0.11) 
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AND (5.4321 — 13.3342 < —6.03)] 
THEN x€w, 


as should be evident in that figure. Note especially that information corresponding to 
nodes near the root can be pruned by C4.5Rules. This is more general than impurity 
based pruning methods, which instead merge leaf nodes. 


8.4.3 Which tree classifier is best? 


In Chap. ?? we shall consider the problem of comparing different classifiers, including 
trees. Here, rather than directly comparing typical implementations of CART, ID3, 
C4.5 and other numerous tree methods, it is more instructive to consider variations 
within the different component steps. After all, with care one can generate a tree using 
any reasonable feature processing, impurity measure, stopping criterion or pruning 
method. Many of the basic principles applicable throughout pattern classification 
guide us here. Of course, if the designer has insight into feature preprocessing, this 
should be exploited. The binning of real-valued features used in early versions of ID3 
does not take full advantage of order information, and thus ID3 should be applied 
to such data only if computational costs are otherwise too high. It has been found 
that an entropy impurity works acceptably in most cases, and is a natural default. In 
general, pruning is to be preferred over stopped training and cross-validation, since it 
takes advantage of more of the information in the training set; however, pruning large 
training sets can be computationally expensive. The pruning of rules is less useful 
for problems that have high noise and are at base statistical in nature, but such 
pruning can often simplify classifiers for problems where the data were generated 
by rules themselves. Likewise, decision trees are poor at inferring simple concepts, 
for instance whether more than half of the binary (discrete) attributes have value 
+1. As with most classification methods, one gains expertise and insight through 
experimentation on a wide range of problems. No single tree algorithm dominates or 
is dominated by others. 


It has been found that trees yield classifiers with accuracy comparable to other 
methods we have discussed, such as neural networks and nearest-neighbor classifiers, 
especially when specific prior information about the appropriate form of classifier is 
lacking. Tree-based classifiers are particularly useful with non-metric data and as 
such they are an important tool in pattern recognition research. 


8.5 *Recognition with strings 


Suppose the patterns are represented as ordered sequences or strings of discrete items, 
as in a sequence of letters in an English word or in DNA bases in a gene sequence, 
such as “AGCTTCGAATC.” (The letters A, G, C and T stand for the nucleic acids adenine, 
guanine, cytosine and thymine.) Pattern classification based on such strings of discrete 
symbols differs in a number of ways from the more commonly used techniques we 
have addressed up to here. Because the string elements — called characters, letters 
or symbols — are nominal, there is no obvious notion of distance between strings. 
There is a further difficulty arising from the fact that strings need not be of the 
same length. While such strings are surely not vectors, we nevertheless broaden our 
familiar boldface notation to now apply to strings as well, e.g., x = “AGCTTC,” though 
we will often refer to them as patterns, strings, templates or general words. (Of course, 
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there is no requirement that these be meaningful words in a natural language such as 
English or French.) A particularly long string is denoted text. Any contiguous string 
that is part of x is called a substring, segment, or more frequently a factor of x. For 
example, “GCT” is a factor of “AGCTTC.” 

There is a large number of problems in computations on strings. The ones that 
are of greatest importance in pattern recognition are: 


String matching: Given x and tezt, test whether x is a factor of text, and if so, 
where it appears. 


Edit distance: Given two strings x and y, compute the minimum number of ba- 
sic operations — character insertions, deletions and exchanges — needed to 
transform x into y. 


String matching with errors: Given x and text, find the locations in text where 
the “cost” or “distance” of x to any factor of text is minimal. 


String matching with the “don’t care” symbol: This is the same as basic string 
matching, but with a special symbol, Ø, the don’t care symbol, which can match 
any other symbol. 


We should begin by understanding the several ways in which these string opera- 
tions are used in pattern classification. Basic string matching can be viewed as an 
extreme case of template matching, as in finding a particular English word within a 
large electronic corpus such as a novel or digital repository. Alternatively, suppose 
we have a large text such as Herman Melville's Moby Dick, and we want to classify 
it as either most relevant to the topic of fish or to the topic of hunting. Test strings 
or keywords for the fish topic might include “salmon,” “whale,” “fishing,” “ocean,” 
while those for hunting might include “gun,” “bullet,” “shoot,” and so on. String 
matching would determine the number of occurrences of such keywords in the text. 
A simple count of the keyword occurrences could then be used to classify the text 
according to topic. (Other, more sophisticated methods for this latter stage would 
generally be preferable.) 

The problem of string matching with the don't care symbol is closely related 
to standard string matching, even though the best algorithms for the two types of 
problems differ, as we shall see. Suppose, for instance, that in DNA sequence analysis 
we have a segment of DNA, such as x = “AGCCGDODDOGACTG,” where the first and last 
sections (called motifs) are important for coding a protein while the middle section, 
which consists of five characters, is nevertheless known to be inert and to have no 
function. If we are given an extremely long DNA sequence (the text), string matching 
with the don’t care symbol using the pattern x containing © symbols would determine 
if text is in the class of sequences that could yield the particular protein. 

The string operation that finds greatest use in pattern classification is based on 
edit distance, and is best understood in terms of the nearest-neighbor algorithm 
(Chap. ??). Recall that in that algorithm each training pattern or prototype is stored 
along with its category label; an unknown test pattern is then classified by its near- 
est prototype. Suppose now that the prototypes are strings and we seek to classify 
a novel test string by its “nearest” stored string. For instance an acoustic speech 
recognizer might label every 10-ms interval with the most likely phoneme present in 
an utterance, giving a string of discrete phoneme labels such as “tttoooonn.” Edit 
distance would then be used to find the “nearest” stored training pattern, so that its 
category label can be read. 
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The difficulty in this approach is that there is no obvious notion of metric or 
distance between strings. In order to proceed, then, we must introduce some measure 
of distance between the strings. The resulting edit distance is the minimum number 
of fundamental operations needed to transform the test string into a prototype string, 
as we shall see. 

The string-matching-with-errors problem contains aspects of both the basic string 
matching and the edit distance problems. The goal is to find all locations in text 
where x is “close” to the substring or factor of text. This measure of closeness is 
chosen to be an edit distance. Thus the string-matching-with-errors problem finds 
use in the same types of problems as basic string matching, the only difference being 
that there is a certain “tolerance” for a match. It finds use, for example, in searching 
digital texts for possibly misspelled versions of a given target word. 


Naturally, deciding which strings to consider is highly problem-dependent. Nev- 
ertheless, given target strings and the relevance of tolerances, and so on, the string 
matching problems just outlined are conceptually very simple; the challenge arises 
when the problems are large, such as searching for a segment within the roughly 
3 x 10° base pairs in the human genome, the 3 x 107 characters in an electronic ver- 
sion of War and Peace or the more than 1013 characters in a very large digital 
repository. For such cases, the effort is in finding tricks and heuristics that make the 
problem computationally tractable. 

We now consider these four string operations in greater detail. 


8.5.1 String matching 


The most fundamental and useful operation in string matching is testing whether a 
candidate string x is a factor of text. Naturally we assume the number of characters 
in text, denoted length|tert] or |text|, is greater than that in x, and for most com- 
putationally interesting cases |text| > |x|. Each discrete character is taken from an 
alphabet A, for example binary or decimal numerals, the English letters, or four DNA 
bases, i.e., A = {0,1} or {0,1,2,...,9} or {a,b,c,...,z} or {A,G,C,T}, respec- 
tively. A shift, s, is an offset needed to align the first character of x with character 
number s + 1 in text. The basic string matching problem is to find whether there 
exists a valid shift, i.e., one where there is a perfect match between each character in 
x and the corresponding one in text. The general string-matching problem is to list 
all valid shifts (Fig. 8.7). 


text a b a cd bdacbbacdasc 


x > b d a c 


Figure 8.7: The general string-matching problem is to find all shifts s for which the 
pattern x appears in text. Any such shift is called valid. In this case x = “bdac” is 
indeed a factor of text, and s = 5 is the only valid shift. 


The most straightforward approach in string matching is to test each possible shift 
s in turn, as given in the naive string-matching algorithm. 
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Algorithm 1 (Naive string matching) 


1 begin initialize A, x, text, n — length[text],m — length[x] 


2 s=0 

3 while s <n—m 

4 if x[1...m] = tert[s + 1...s +m] 

5 then print “pattern occurs at shift” s 
6 s=s+1l 

7 return 

8 end 


Algorithm 1 is hardly optimal — it takes time O((n —m-+1)m) in the worst case; if x 
and text are random, however, the algorithm is efficient (Problem 18). The weakness 
in the naive string-matching algorithm is that information from one candidate shift 
s is not exploited when seeking a subsequent candidate shift. A more sophisticated 
method, the Boyer-Moore algorithm, uses such information in a clever way. 


Algorithm 2 (Boyer-Moore string matching) 


1 begin initialize A, x, text, n — length|text], m — length[x] 


2 F(x) — last-occurrence function 

3 G(x) — good-suffix function 

4 s=0 

5 while s <n—m 

6 do jam 

7 while j > 0 and x[j] = text[s + j] 

3 do j=j-1 

9 if 7 =0 

10 then print “pattern occurs at shift” s 
11 s —s+6(0) 

12 else s — s+ max[G (5), j — Fltext[s + j])] 
13 return 

14 end 


Postponing for the moment considerations of the functions F and G, we can see that 
the Boyer-Moore algorithm resembles the naive string-matching algorithm, but with 
two exceptions. First, at each candidate shift s, the character comparisons are done 
in reverse order, i.e., from right to left (line 8). Second, according to lines 11 & 12, 
the increment to a new shift apparently need not be 1. 

The power of Algorithm 2 lies in two heuristics that allow it to skip the examination 
of a large number shifts and hence character comparisons: the good-suffix heuristic and 
the bad-character heuristic operate independently and in parallel. After a mismatch 
is detected, each heuristic proposes an amount by which s can be safely increased 
without missing a valid shift; the larger of these proposed shifts is selected and s is 
increased accordingly. 

The bad-character heuristic utilizes the rightmost character in text that does not 
match the aligned character in x. Because character comparisons proceed right-to- 
left, this “bad character” is found as efficiently as possible. Since the current shift s is 
invalid, no more character comparisons are needed and a shift increment can be made. 
The bad-character heuristic proposes incrementing the shift by an amount to align 
the rightmost occurrence of the bad character in x with the bad character identified 
in text. This guarantees that no valid shifts have been skipped (Fig. 8.8). 
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Figure 8.8: String matching by the Boyer-Moore algorithm takes advantage of infor- 
mation obtained at one shift s to propose the next shift; the algorithm is generally 
much less computationally expensive than naive string matching, which always incre- 
ments shifts by a single character. The top figure shows the alignment of text and 
pattern x for an invalid shift s. Character comparisons proceed right to left, and 
the first two such comparisons are a match — the good suffix is “es.” The first 
(right-most) mismatched character in text, here “i,” is called the bad character. The 


bbs 99 


bad-character heuristic proposes incrementing the shift to align the right-most “i 
in x with the bad character “i” in text — a shift increment of 3, as shown in the 
middle figure. The bottom figure shows the effect of the good-suffix heuristic, which 
proposes incrementing the shift the least amount that will align the good suffix, “es” 
in x, with that in text — here an increment of 7. Lines 11 & 12 of the Boyer-Moore 
algorithm select the larger of the two proposed shift increments, i.e., 7 in this case. 
Although not shown in this figure, after the mismatch is detected at shift s +7, both 
the bad-character and the good-suffix heuristics propose an increment of yet another 
7 characters, thereby finding a valid shift. 


Now consider the good-suffiz heuristic, which operates in parallel with the bad- 
character heuristic, and also proposes a safe shift increment. A general suffix of x is 
a factor or substring of x that contains the final character in x. (Likewise, a prefix 
contains the initial character in x.) At shift s the rightmost contiguous characters in 
text that match those in x are called the good suffix, or “matching suffix.” As before, 
because character comparisons are made right-to-left, the good suffix is found with 
the minimum number of comparisons. Once a character mismatch has been found, the 
good-suffix heuristic proposes to increment the shift so as to align the next occurrence 
of the good suffix in x with that identified in text. This insures that no valid shift has 
been skipped. Given the two shift increments proposed by the two heuristics, line 12 
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of the Boyer-Moore algorithm chooses the larger. 

These heuristics rely on the functions F and G. The last-occurrence function, 
F(x), is merely a table containing every letter in the alphabet and the position of its 
rightmost occurrence in x. For the pattern in Fig. 8.8, the table would contain: a, 6; 
e, 8; i, 4; m, 5; s, 9; and t, 8. All 20 other letters in the English alphabet are assigned 
a value 0, signifying that they do not appear in x. The construction of this table is 
simple (Problem 22) and need be done just once; it does not significanly affect the 
computational cost of the Boyer-Moore algorithm. 

The good-suffix function, G(x), creates a table which for each suffix gives the 
location of its other occurrences in x. In the example in Fig. 8.8, the suffix s (the 
last character in “estimates” ) also occurs at position 2 in x. Further, the suffix “es” 
occurs at position 1 in x. The suffix “tes” does not appear elsewhere in x and hence 
it, and all other suffixes, are assigned the value 0. In sum, then, the table of G(x) 
would have just two non-zero entries: s, 2 and es, 1. 

In practice, these heuristics make the Boyer-Moore one of the most attractive 
string-matching algorithms on serial computers. Other powerful methods quickly be- 
come conceptually more involved and are generally based on precomputing functions 
of x that enable efficient shift increments, or dividing the problem for efficient parallel 
computation. 

Many applications require a text to be searched for several strings, as in the case 
of keyword search through a digital text. Occasionally, some of these search strings 
are themselves factors of other search strings. Presumably we would not want to 
acknowledge a match of a short string if it were also part of a match for a longer string. 
Thus if our keywords included “beat,” “eat,” and “be,” we would want our search to 
return only the string match of “beat” from text = “when_chris_beats_the_drum,” 
not the shorter strings “eat” and “be,” which are nevertheless “there” in text. This is 
an example of the subset-superset problem. Although there may be much bookkeeping 
associated with imposing such a strict bias for longer sequences over shorter ones, the 
approach is conceptually straightforward (Computer exercise 9). 


8.5.2 Edit distance 


The fundamental idea underlying pattern recognition using edit distance is based on 
the nearest-neighbor algorithm (Chap. ??). We store a full training set of strings 
and their associated category labels. During classification, a test string is compared 
to each stored string and a “distance” or score is computed; the test string is then 
assigned the category label of the “nearest” string in the training set. 

Unlike the case using real-valued vectors discussed in Chap. ??, there is no single 
obvious measure of the similarity or difference between two strings. For instance, it 
is not clear whether “abbccc” is closer to “aabbcc” or to “abbcccb.” To proceed, 
then, we introduce a measure of the difference between two strings. Such an edit 
distance between x and y describes how many fundamental operations are required 
to transform x into y. These fundamental operations are: 


substitutions: A character in x is replaced by the corresponding character in y. 


insertions: A character in y is inserted into x, thereby increasing the length of x by 
one character. 


deletions: A character in x is deleted, thereby decreasing the length of x by one 
character. 
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INTERCHANGE Occasionally we also consider a fourth operation, interchange, or “twiddle,” or transpo- 
sition, which interchanges two neighboring characters in x. Thus, one could transform 
x = “asp” into y = “sap” with a single interchange. Because such an interchange 
can always be expressed as two substitutions, for simplicity we shall not consider 
interchanges. 
Let C be an m x n matrix of integers associated with a cost or “distance” and let 
ô(-,-) denote a generalization of the Kronecker delta function, having value 1 if the 
two arguments (characters) match and 0 otherwise. The basic edit-distance algorithm 


is then: 
Algorithm 3 (Edit distance) 


1 begin initialize A,x,y,m — length[x],n — lengthly] 


2 C[0, 0] — 0 

3 10 

4 doi+i+l 

5 Cli, 0] — i 

6 until i = m 

7 7-0 

s doj— ¿+1 

9 C[0,j] — j 

10 until 7 =n 

11 10; j=0 

12 doi i+1 

13 doj=j+1 

14 Cli, j] = min[ Cli — 1,5] +1,Cli,j —1]+1,Cli —1,j-—1] +1- ô(x{i], y [5)) ] 
insertion a da ae 

15 until 7 =n 

16 until i = m 

17 return C[m, n] 


18 end 


Lines 4 — 10 initialize the left column and top row of C with the integer number 
of “steps” away from i = 0,7 = 0. The core of this algorithm, line 14, finds the 
minimum cost in each entry of C, column by column (Fig. 8.9). Algorithm 3 is 
thus greedy in that each column of the distance or cost matrix is filled using merely 
the costs in the previous column. Linear programming techniques can also be used 
to find a global minimum, though this nearly always requires greater computational 
effort (Problem 27). 

If insertions and deletions are equally costly, then the symmetry property of a 
metric holds. However, we can broaden the applicability of the algorithm by allowing 
in line 14 different costs for the fundamental operations; for example insertions might 
cost twice as much as substitutions. In such a broader case, properties of symmetry 
and the triangle inequality no longer hold and edit distance is not a true metric 
(Problem 28). 

As shown in Fig. 8.9, x = “excused” can be transformed to y = “exhausted” 
through one substitution and two insertions. The table shows the steps of this trans- 
formation, along with the computed entries of the cost matrix C. For the case shown, 
where each fundamental operation has a cost of 1, the edit distance is given by the 
value of the cost matrix at the sink, i.e., C[7,9] = 3. 


8.5. *RECOGNITION WITH STRINGS 29 


| deletion: 
remove letter of x 
insertion: 
— . . 
insert letter of y into x 
N exchange: 
replace letter of x by letter of y 
E no change 
N g 


sink 


Figure 8.9: The edit distance calculation for strings x and y can be illustrated in 
a table. Algorithm 3 begins at source, i = 0,7 = 0, and fills in the cost matrix C, 
column by column (shown in red), until the full edit distance is placed at the sink, 
Cli = m, j = n]. The edit distance between “excused” and “exhausted” is thus 3. 


x | excused source string C[0,0] = 0 
exhused substitute h for c | C[3,3] = 1 
exhaused insert a C[3,4] = 2 
exhausted | insert t C[5,7| = 3 

y | exhausted | target string C[7,9] = 3 


8.5.3 Computational complexity 


Algorithm 3 is O(mn) in time, of course; it is O(m) in space (memory) since only the 
entries in the previous column need be stored when computing Cfi, j] for i = 0 to m. 
Because of the importance of string matching and edit distance throughout computer 
science, a number of algorithms have been proposed. We need not delve into the 
details here (but see the Bibliography) except to say that there are sophisticated 
string-matching algorithms with time complexity O(m + n). 


8.5.4 String matching with errors 


There are several versions of the string-matching-with-errors problem; the one that 
concerns us is this: given a pattern x and text, find the shift for which the edit 
distance between x and a factor of text is minimum. The algorithm for the string- 
matching-with-errors problem is very similar to that for edit distance. Let E be a 
matrix of costs, analogous to C in Algorithm 3. We seek a shift for which the edit 
distance to a factor of text is minimum, or formally min[C(x, y)] where y is any factor 
of text. To this end, the algorithm must compute its new cost E whose entries are 
Eli, 7] = min[C(x{1...i], y[1...7])]. 

The principal difference between the algorithms for the two problems (i.e., with 
or without errors) is that we initialize E[0,j] to 0 in the string matching with errors 
problem, instead of to j in lines 4 — 10 of the basic string matching algorithm. This 
initialization of E expresses the fact that the “empty” prefix of x matches an empty 
factor of text, and contributes no cost. 
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Figure 8.10: The string matching with errors problem is to find the shift s for which the 
edit distance between x and an aligned factor of text is minimum. In this illustration, 
the minimum edit distance is 1, corresponding to the character exchange u — i and 
the shift s = 11 is the location. 


Two minor heuristics for reducing computational effort are relevant to the string- 
matching-with-errors problem. The first is that except in highly unusual cases, the 
length of the candidate factors of text that need be considered are roughly equal 
to length[x]. Second, for each candidate shift, the edit-distance calculation can be 
terminated if it already exceeds the current minimum. In practice, this latter heuris- 
tic can reduce the computational burden significantly. Otherwise, the algorithm for 
string matching with errors is virtually the same as that for edit distance (Computer 
exercise 10). 


8.5.5 String matching with the “don’t-care” symbol 


String matching with the “don’t-care” symbol, Ø, is formally the same as basic string 
matching, but the © in either x or text is said to match any character (Fig. 8.11). 


text rich cin_longÓstrómn 


x 
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Figure 8.11: String matching with don’t care symbol is the same as basic string 
matching except the Ø symbol — in either text or x — matches any character. The 
figure shows the only valid shift. 


An obvious approach to string matching with the don't care symbol is to modify 
the naive string-matching algorithm to include a condition for matching the don’t 
care symbol. Such an approach, however, retains the computational inefficiencies of 
naive string matching (Problem 29). Further, extending the Boyer-Moore algorithm 
to include Ø is somewhat difficult and inefficient. The most effective methods are 
based on fundamental methods in computer arithmetic and, while fascinating, would 
take us away from our central concerns of pattern recognition (cf. Bibliography). The 
use of this technique in pattern recognition is the same as string matching, with a 
particular type of “tolerance.” 


While learning is a general and fundamental technique throughout pattern recog- 
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nition, it has found limited use in recognition with basic string matching. This is 
because the designer typically knows precisely which strings are being sought — they 
do not need to be learned. Learning can, of course, be based on the outputs of a 
string-matching algorithm, as part of a larger pattern recognition system. 


8.6 Grammatical methods 


Up to here, we have not been concerned with any detailed models that might underly 
the generation of the sequence of characters in a string. We now turn to the case 
where rules of a particular sort were used to generate the strings and thus where their 
structure is fundamental. Often this structure is hierarchical, where at the highest 
or most abstract level a sequence is very simple, but at subsequent levels there is 
greater and greater complexity. For instance, at its most abstract level, the string 
“The history book clearly describes several wars” is merely a sentence. At 
a somewhat more detailed level it can be described as a noun phrase followed by a 
verb phrase. The noun phrase can be expanded at yet a subsequent level, as can the 
verb phrase. The expansion ends when we reach the words “The,” “history,” and 
so forth — items that are considered the “characters,” atomic and without further 
structure. Consider too strings representing valid telephone numbers — local, national 
and international. Such numbers conform to a strict structure: either a country code 
is present or it is not; if not, then the domestic national code may or may not be 
present; if a country code is present, then there is a set of permissible city codes and 
for each city there is a set of permissible area codes and individual local numbers, and 
so on. 


As we shall see, such structure is easily specified in a grammar, and when such 
structure is present the use of a grammar for recognition can improve accuracy. For in- 
stance, grammatical methods can be used to provide constraints for a full system that 
uses a statistical recognizer as a component. Consider an optical character recogni- 
tion system that recognizes and interprets mathematical equations based on a scanned 
pixel image. The mathematical symbols often have specific “slots” that can be filled 
with certain other symbols; this can be specified by a grammar. Thus an integral sign 
has two slots, for upper and lower limits, and these can be filled by only a limited set 
of symbols. (Indeed, a grammar is used in many mathematical typesetting programs 
in order to prevent authors from creating meaningless “equations.”) A full system 
that recognizes the integral sign could use a grammar to limit the number of candi- 
date categories for a particular slot, and this increases the accuracy of the full system. 
Similarly, consider the problem of recognizing phone numbers within acoustic speech 
in an automatic dialing application. A statistical or Hidden-Markov-Model acoustic 
recognizer might perform word spotting and pick out number words such as “eight” 
and “hundred.” A subsequent stage based on a formal grammar would then exploit 
the fact that telephone numbers are highly constrained, as mentioned. 

We shall study the case when crisp rules specify how the representation at one 
level leads to a more expanded and complicated representation at the next level. We 
sometimes call a string generated by a set of rules a sentence; the rules are specified 
by a grammar, denoted G. (Naturally, there is no requirement that these be related 
in any way to sentences in natural language such as English.) In pattern recognition, 
we are given a sentence and a grammar, and seek to determine whether the sentence 
was generated by G. 
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8.6.1 Grammars 


The notion of a grammar is very general and powerful. Formally, a grammar G 
consists of four components: 


symbols: Every sentence consists of a string of characters (which are also called 
primitive symbols, terminal symbols or letters), taken from an alphabet A. For 
bookkeeping, it is also convenient to include the null or empty string denoted e, 
which has length zero; if e is appended to any string x, the result is again x. 


variables: These are also called non-terminal symbols, intermediate symbols or oc- 
casionally internal symbols, and are taken from a set Z. 


root symbol: The root symbol or starting symbol is a special internal symbol, the 
source from which all sequences are derived. The root symbol is taken from a 
set S. 


productions: The set of production rules, rewrite rules, or simply rules, denoted P, 
specify how to transform a set of variables and symbols into other variables and 
symbols. These rules determine the core structures that can be produced by the 
grammar. For instance if A is an internal symbol and c a terminal symbol, the 
rewrite rule cA — cc means that any time the segment cA appears in a string, 
it can be replaced by cc. 


Thus we denote a general grammar by its alphabet, its variables, its particular root 
symbol, and the rewrite rules: G = (A,Z,S,P). The language generated by gram- 
mar, denoted L(G), is the set of all strings (possibly infinite in number) that can be 
generated by G. 
Consider two examples; the first is quite simple and abstract. Let A = {a,b,c}, 
pı: S >aSBAORaBA pe: AB — BA 
S=S,T={A,B,C},andP = 4 p3: bB => bb pa: bA — be 
Ps: cA —> cc Pe: aB — ab 
(In order to make the list of rewrite rules more compact, we shall condense rules 
having the same left hand side by means of the OR on the right hand side. Thus rule 
pı is a condensation of the two rules S + aSBA and S — aBA.) If we start with S 
and apply the rewrite rules in the following orders, we have the following two cases: 


root S$ root S$ 

pi aBA pi aSBA 

Pe abd pı aaBABA 

p4 abc Pe aabABA 
P2 aabBAA 
p3 aabbAA 
p4  aabbcA 
Ps aabbcc 


After the rewrite rules have been applied in these sequences, no more symbols match 
the left-hand side of any rewrite rule, and the process is complete. Such a trans- 
formation from the root symbol to a final string is called a production. These two 
productions show that abc and aabbcc are in the language generated by G. In fact, 
it can be shown (Problem 38) that this grammar generates the language L(G) = 
{a"b"c"|n > 1}. 
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A much more complicated grammar underlies the English language, of course. The 
alphabet consists of all English words, A = {the, history, book, sold, over, 1000, 
copies, ...}, and the intermediate symbols are the parts of speech: T = ((noun), 
(verb), (noun phrase), (verb phrase), (adjective), (adverb), (adverbial phrase) }. 
The root symbol here is S = (sentence). A restricted set of the production rules in 
English includes: 

(sentence) —> (noun phrase) (verb phrase) 
(noun phrase) > (adjective) (noun phrase) 

p= (verb phrase) — (verb phrase) (adverbial phrase) 

(noun) —> book OR theorem OR ... 

(verb) — describes OR buys OR holds OR ... 
(adverb) > over OR ... 
This subset of the rules of English grammar does not prevent the generation of mean- 
ingless sentences, of course. For instance, the nonsense sentence “Squishy green 
dreams hop heuristically” can be derived in this subset of English grammar. Fig- 
ure 8.12 shows the steps of a production in a derivation tree, where the root symbol 
is displayed at the top and the terminal symbols at the bottom. 


<sentence> 
<noun phrase> <verb phrase> 
<adjective>  <noun phrase> <verb>  <adverbial phrase> 
AN SS En F 
<adjective>  <noun phrase > <preposition> <noun phrase> 
history over LN 
<noun> <adjective>  <noun phrase> 
book 1000 
<noun> 
copies 


Figure 8.12: This derivation tree illustrates how a portion of English grammar can 
transform the root symbol, here (sentence), into a particular sentence or string of 
elements, here English words, which are read from left to right. 


8.6.2 Types of string grammars 


There are four main types of grammar, arising from different types of structure in the 
productions. As we have seen, a rewrite rule is of the form a — 5, where a and P are 
strings made up of intermediate and terminal symbols. 


Type 0: Free or unrestricted Free grammars have no restrictions on the rewrite 
rules and thus they provide no constraints or structure on the strings they can 
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produce. While in principle they can express an arbitrary set of rules, this 
generality comes at the tremendous expense of possibly unbounded learning 
time. Knowing that a string is derived from a type 0 grammar provides no 
information and as such, type 0 grammars in general have but little use in 
pattern recognition. 


Type 1: Context-sensitive A grammar is called context-sensitive if every rewrite 
rule is of the form 


al — axB 


where a and ĝ are any strings made up of intermediate and terminal symbols, 
I is an intermediate symbol and x is an intermediate or terminal symbol (other 
than e). We say that “I can be rewritten as x in the context of a on the left 
and p on the right.” 


Type 2: Context-free A grammar is called context free if every production is of 
the form 


I=>ux 


where J is an intermediate symbol and x an intermediate or terminal symbol 
(other than e). Clearly, unlike a type 1 grammar, here there is no need for a 
“context” for the rewriting of I by x. 


Type 3: Finite State or Regular A grammar is called regular if every rewrite rule 
is of the form 


a => zp OR az 


where a and ĝ are made up of intermediate symbols and z is a terminal symbol 
(other than e). Such grammars are also called finite state because they can be 
generated by a finite state machine, which we shall see in Fig. 8.16. 


A language generated by a grammar of type 7 is called a type i language. It can be 
shown that the class of grammars of type i includes all grammars of type i + 1; thus 
there is a strict hierarchy in grammars. 

Any context-free grammar can be converted into one in Chomsky normal form 


CHOMSKY (CNF). Such a grammar has all rules of the form 
NORMAL 
FORM A— BC and Az 


where A, B and C are intermediate symbols (that is, they are in Z) and z is a terminal 
symbol. For every context-free grammar G, there is another G’ in Chomsky normal 
form such that L(G) = L(G”) (Problem 36). 


Example 3: A grammar for pronouncing numbers | 


In order to understand these issues better, consider a grammar that yields pro- 
nunciation of any number between 1 and 999,999. The alphabet has 29 basic terminal 
symbols, i.e., the spoken words 
A = {one, two, ..., ten, eleven, ..., twenty, thirty, ..., ninety, hundred, thousand}. 
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There are six non-terminal symbols, corresponding to general six-digit, three-digit, 
and two-digit numbers, the numbers between ten and nineteen, and so forth, as will 
be clear below: 
T = {digits6, digits3, digits2, digitl, teens, tys}. 
The root node corresponds to a general number up to six digits in length: 
S = digits6. 
The set of rewrite rules is based on a knowledge of English: 
digits6 — digits3 thousand digits3 
digits6 — digits3 thousand OR digits3 
digits3 — digitl hundred digits2 
digits3 — digitl hundred OR digits2 
digits2 — teens OR tys OR tys digitl OR digitl 
digitl — one OR two OR ... nine 
teens — ten OR eleven OR ... nineteen 
tys — twenty OR thirty OR ... OR ninety 
The grammar takes digit6 and applies the productions until the elements in the 
final alphabet are produced, as shown in the figure. Because it contains rewrite rules 
such as digits6 — digits3 thousand, this grammar cannot be type 3. It is easy to 
confirm that it is type 2. 


digit6 digitó 
digits3 thousand digits3 digits3 thousand  digits3 
digitl hundred digits2 digits2 digitl digitl hundred  digits2 
SIX tys  digitl teens two nine tys digitl 
thirty nine fourteen fifty three 
639,014 2,953 


These two derivation trees show how the grammar G yields the pronunciation of 
639,014 and 2,953. The final string of terminal symbols is read from left to right. 


8.6.3 Recognition using grammars 


Recognition using grammars is formally very similar to the general approaches used 
throughout pattern recognition. Suppose we suspect that a test sentence was gen- 
erated by one of c different grammars, G1, G2,..., Ge, which can be considered as 
different models or classes. A test sentence x is classified according to which gram- 
mar could have produced it, or equivalently, the language £(G;) of which x is a 
member. 

Up to now we have worked forward — forming a derivation from a root node to 
a final sentence. For recognition, though, we must employ the inverse process: that 
is, given a particular x, find a derivation in G that leads to x. This process, called 
parsing, is virtually always much more difficult than forming a derivation. We now 
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discuss one general approach to parsing, and briefly mention two others. 


Bottom-up parsing 


Bottom-up parsing starts with the test sentence x, and seeks to simplify it, so as to 
represent it as the root symbol. The basic approach is to use candidate productions 
from P “backwards,” i.e., find rewrite rules whose right hand side matches part of 
the current string, and replace that part with a segment that could have produced it. 
This is the general method in the Cocke- Younger-Kasami algorithm, which fills a parse 
table from the “bottom up.” The grammar must be expressed in Chomsky normal 
form and thus the productions P must all be of the form A — BC, a broad but 
not all inclusive category of grammars. Entries in the table are candidate strings in a 
portion of a valid derivation. If the table contains the source symbol S, then indeed 
we can work forward from S and derive the test sentence, and hence x € L(G). In 
the following, x; (for i = 1,...n) represents the individual terminal characters in the 
string to be parsed. 


Algorithm 4 (Bottom-up parsing) 
1 begin initialize G = (A, T, S, P), X = £122... n 


2 i— 0 

3 do ie i+l 

4 Va = {A| A> zi} 

5 until ¿ = n 

6 j=1 

7 do j- j+1 

8 i— 0 

9 do i i+1 

10 Vij — 0 

11 k=0 

12 do k~k+1 

13 Vij — Vij U {A | A> BC EP, BE Vix and C € Vi4r ¡nt 
14 until k = 7-1 

15 until i =n-—7+1 

16 until 7 =n 

17 if S € Vin then print “parse of” x “successful in G” 
18 return 

19 end 


Consider the operation of Algorithm 4 in the following simple abstract example. 
Let the grammar G have two terminal and three intermediate symbols: A = {a,b}, 
and Z = {A,B,C}. The root symbol is S, and there are just four production rules: 

Pi: S ~ABORBC 
pad P2: A > BAORa 
Y pgs: B—-+CCORD 
pa: C > >ABORa 

Figure 8.13 shows the parse table generated by Algorithm 4 for the input string x 
= “baaba.” Along the bottom are the characters x; of this string. Lines 2 through 
5 of the algorithm fill in the first (j = 1) row with any internal symbols that derive 
the corresponding character in x. The 7 = 1 and 1 = 4 entries of that bottom row are 
filled with B, since rewrite rule p3: B — b. Likewise the remaining entries are filled 
with both A and C, as a result of rewrite rules p2 and p4. 
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The core computation in the algorithm is performed in line 13, which fills entries 
throughout the table with symbols that could produce segments in lower rows, and 
hence might be part of a valid derivation (if indeed one is found). For instance, 
the i = 1,j = 2 entries contain any symbols that could produce segments in the 
row beneath it. Thus this entry contains S because by rule pı: S — BC, and also 
contains A because by rule pg: A — BA. According to the innermost loop over k 
(lines 12 — 14), we seek the left hand side for rules that span a range. For instance, 
the i = 3,7 = 3 entry contains B because for k = 2 and rule p3: B > CC (as shown 
in Fig. 8.14). 
strings of length 1 


strings of length 2 
j strings of length 3 
j strings of length 4 

strings of length 5 

target string x 


1 2 3 4 5 


Figure 8.13: The bottom-up parsing algorithm fills the parse table with symbols that 
might be part of a valid derivation. The pink lines are not provided by the algorithm, 
but when read downward from the root symbol confirm that a valid derivation exists. 


Figure 8.14 shows the cells that are searched when filling a particular cell in the 
parse table. The sequence sweeps vertically up to the cell in question, while diagonally 
down from the cell in question; this guarantees that the all paths from the top cell 
in a valid derivation can be found. If the top cell contains the root symbol S (and 
possibly other symbols), then indeed the string is successfully parsed. That is, there 
exists a valid production leading from S to the target string x. 

To understand how this table is filled, consider first the 7 = 1 row. The j = 4,2 = 1 
cell contains B, because according to rewrite rule p3, B is the only intermediate 
symbol that could yield b in the query sentence, directly below. The same logic holds 
for the i = 1,7 = 1 cell. The remaining three cells for 7 = 1 contain A and C, 
since these are the only intermediate variables that can derive a. Incidentally, the 
derivation in Fig. 8.15 confirms that the parse is valid. 

The computational complexity of bottom-up parsing performed by Algorithm 4 is 
high. The innermost loop of line 13 is executed n or fewer times, while lines 7 & 9 
are O(n”), which is also the space complexity. The time complexity is O(n°). 


Top-down and other methods of parsing 


As its name suggests, top-down parsing starts with the root node and successively 
applies productions from P, with the goal of finding a derivation of the test sentence 
x. Since it is rare that the sentence is derived in a single production, it is necessary to 
specify some criteria to guide the choice of which rewrite rule to apply. Such criteria 
could include beginning the parse at the first (left) character in the sentence (that 
is, finding a small set of rewrite rules that yield the first character), then iteratively 
expanding the production to derive subsequent characters. 


FINITE 
STATE 
MACHINE 
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kel 2 E E A 
A V, A V, A 
Leen 
J | Van J Va J 
Y Va Noe 
Vis Viera 
l > L > 1 > l > 


Figure 8.14: The innermost loop of Algorithm 4 seeks to fill a cell Vi; (outlined in 
red) by the left-hand side of any rewrite rule whose right-hand side corresponds to 
symbols in the two shaded cells. As k is incremented, the cells queried move vertically 
upward to the cell in question, and diagonally down from that cell. The shaded cells 
show the possible right-hand sides in a derivation, as illustrated by the pink lines in 
Fig. 8.13. 


S 
A B 
BA CC 
b aA Ba 
a b 


Figure 8.15: This valid derivation of “babaa” in G can be read from the pink lines in 
the parse table of Fig. 8.13 generated by the bottom-up parse algorithm. 


The bottom-up and top-down parsers just described are quite general and there 
are a number of parsing algorithms which differ in space and time complexities. Many 
parsing methods depend upon the model underlying the grammar. One popular such 
model is finite state machines. Such a machine consists of nodes and transition links; 
each node can emit a symbol, as shown in Fig. 8.16. 


8.7 Grammatical inference 


In many applications, the grammar is designed by hand. Nevertheless, learning plays 
an extremely important role in pattern recognition research and it is natural that 
we attempt to learn a grammar from example sentences it generates. When seeking 
to follow that general approach we are immediately struck by differences between 
the areas addressed by grammatical methods and those that can be described as 
statistical. First, for most languages there are many — often an infinite number 
of — grammars that can produce it. If two grammars G and Ga generate the 
same language (and no other sentences), then this ambiguity is of no consequence; 
recognition will be the same. However, since training is always based on a finite set 
of samples, the problem is underspecified. There are an infinite number of grammars 
consistent with the training data, and thus we cannot recover the source grammar 
uniquely. 
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barn 


Sao COLD 


der usd Pal 


Figure 8.16: One type of finite state machine consists of nodes that can emit terminal 
symbols (“the,” “mouse,” etc.) and transition to another node. Such operation can be 
described by a grammar. For instance, the rewrite rules for this finite state machine 
include S — theA, A — mouseB OR cowB, and so on. Clearly these rules imply this 
finite state machine implements a type 3 grammar. The final internal node (shaded) 


would lead to the null symbol e. 


There are two main techniques used to make the problem of inferring a grammar 
from instances tractable. The first is to use both positive and negative instances. That 
is, we use a set D* of sentences known to be derivable in the grammar; we also use 
a set D” that are known to be not derivable in the grammar. In a multicategory 
case, it is common to take the positive instances in G; and use them for negative 
examples in G; for j # i. Even with both positive and negative instances, a finite 
training set rarely specifies the grammar uniquely. Thus our second technique is to 
impose conditions and constraints. A trivial illustration is that we demand that the 
alphabet of the candidate grammar contain only those symbols that appear in the 
training sentences. Moreover, we demand that every production rule in the grammar 
be used. We seek the “simplest” grammar that explains the training instances where 
“simple” generally refers to the total number of rewrite rules, or the sum of their 
lengths, or other natural criterion. These are versions of Occam’s razor, that the 
simplest explanation of the data is to be preferred (Chap ??). 

In broad overview, learning proceeds as follows. An initial grammar G/ is guessed. 
Often it is useful to specify the type of grammar (1, 2 or 3), and thus place constraints 
on the forms of the candidate rewrite rules. In the absence of other prior information, 
it is traditional to make G° as simple as possible and gradually expand the set of 
productions as needed. Positive training sentences x} are selected from D+ one by 
one. If xf cannot be parsed by the grammar, then new rewrite rules are proposed 
for P. A new rule is accepted if and only if it is used for a successful parse of x} and 
does not allow any negative samples to be parsed. 

In greater detail, then, an algorithm for inferring the grammar is: 


Algorithm 5 (Grammatical inference (overview) ) 


1 begin initialize D+,D~,G° 
— |D*| (number of instances in Dt) 
Sus 
A <= set of characters in D+ 
10 
do i~i+l 
read x; from Dt 
if xj cannot be parsed by G 
then do propose additional productions to P and variables to Z 
accept updates if G parses x but no string in D~ 
until i = n” 
eliminate redundant productions 
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13 return G — {A,7,S,P} 
14 end 


Informally, Algorithm 5 continually adds new rewrite rules as required by the 
successive sentences selected from D* so long as the candidate rewrite rule does not 
allow a sentence in D~ to be parsed. Line 9 does not state how to choose the specific 
candidate rewrite rule, but in practice the rule may be chosen from a predefined set 
(with simpler rules selected first), or based on specific knowledge of the underlying 
models generating the sentences. 


Example 4: Grammar inference | 


Consider inferring a grammar G from the following positive and negative examples: 
D* = {a, aaa, aaab, aab}, and DT = {ab, abc, abb, aabb}. Clearly the alphabet of G 
is A = {a,b}. We posit a single internal symbol for G%, and the simplest rewrite rule 
P = {5 > A). 


i| xf P P produces D~ ? 
SoA 

ee No 

S> A 
2 | aaa A-a No 
A-—aA 
SoA 
3 | aaab cee Yes: ab € D7 
A-—aA 
A — ab 
SoA 
A-a 
3 | aaab al No 
A > aab 
SoA 
A-a 
4 | aab Peer No 
A — aab 


The table shows the progress of the algorithm. The first positive instance, a, 
demands a rewrite rule A — a. This rule does not allow any sentences in D~ to be 
derived, and thus is accepted for P. When i = 3, the proposed rule A — ab indeed 
allows x? to be derived, but the rule is rejected because it also derives a sentence 
in D~. Instead, the next proposed rule, A — aab is accepted. The final grammar 
inferred has the four rewrite rules shown at the bottom of the table. 


The method of grammatical inference just described is quite general. It is made 
more specialized by placing restrictions on the types of candidate rewrite rules, cor- 
responding to the designer’s assumptions about the type of grammar (1, 2 or 3). For 
a type 3 grammar, we can consider learning in terms of the finite state machine. In 
that case, learning consists of adding nodes and links (cf. Bibliography). 
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8.8 *Rule-based methods 


In problems where classes can be characterized by general relationships among entities, 
rather than by instances per se, it becomes attractive to build classifiers based on rules. 
Rule-based methods are integral to expert systems in artificial intelligence, but since 
they have found only modest use in pattern recognition, we shall give merely a short 
overview. We shall focus on a broad class of if-then rules for representing and learning 
such relationships. 

A very simple if-then rule is 


IF Swims(x) AND HasScales(x) THEN Fish(x), 


which means, of course, that if an object x has the property that it swims, and the 
property that it has scales, then it is a fish. Rules have the great benefits that they 
are easily interpreted and can be used in database applications where information is 
encoded in relations. A drawback is that there is no natural notion of probability and 
it is somewhat difficult, therefore, to use rules when there is high noise and a large 
Bayes error. 

A predicate, such as Man(-), HasTeeth(-) and AreMarried(-,-), is a test that 
returns a value of logical True or False.* Such predicates can apply to problems where 
the data are numerical non-numerical, linguistic, strings, or any of a broad class of 
types. The choice of predicates and their evaluation depend strongly on the problem, 
of course, and in practice these are generally more difficult tasks than learning the 
rules. For instance, Fig. 8.17 below illustrates the use of rules in categorizing a 
structure as an arch. Such a rule might involve predicates such as Touch(-, -) or 
Supports(-, -, -) which address whether two blocks touch, or whether two blocks 
support a third. It is a very difficult problem in computer vision to evaluate such 
predicates based on a pixel image taken of the scene. 

There are two main types of if-then rules: propositional (variable-free) and first- 
order. A propositional rule describes a particular instance, as in 


IF Male(Bi11) AND IsMarried(Bill) THEN IsHusband(Bill), 


where Bill is a particular atomic item. Because its properties are fixed, Bill is an 
example of a (logical) constant. The deficiency of propositional logic is that it provides 
no general way to represent general relations among a large number of instances. For 
example, even if we knew Male(Edward) and IsMarried(Edward) are both True, the 
above rule would not allow us to infer that Edward is is a husband, since that rule 
applies only to the particular constant Bill. 

This deficiency is overcome in first-order logic, which permit rules with variables, 
such as 


IF Eats(x,y) AND HasFlesh(x) THEN Carnivore(y), 
where here x and y are the variables. This rule states that for any items x and y, if 


y eats x and x has flesh, then y is a carnivore. Clearly this is a very powerful sum- 
mary of an enormous wealth of examples — first-order rules are far more expressive 


* We shall ignore cases where a predicate is Undefined. 
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than classical propositional logic. The power of first-order logic is illustrated in the 
following rules: 


IF Male(x) AND IsMarried(y,z) THEN IsHusband(x), 

IF Parent(x,y) AND Parent(y,z) THEN GrandParent(x, z) 
and 

IF Spouse(x, y) THEN Spouse(y, x). 


A rule from first-order logic can also apply to constants, for instance: 


IF Eats(Mouse, Cat) AND Mammal(Mouse) THEN Carnivore(Cat), 


where Cat and Mouse are two particular constants. 
If-then rules can also incorporate functions, which return numerical values, as 
illustrated in the following: 


IF Male(x) AND (Age(x) < 16) THEN Boy(x), 


where (Age (x) is a function that returns a numerical age in years while the expression 
or term (Age(x) < 16) returns either logical True or False. In sum, the above rule 
states that a male younger than 16 years old is a boy. If we were to use decision trees 
or statistical techniques, we would not be able to learn this rule perfectly, even given 
a tremendously large number of examples. 

It is clear given a set of first-order rules how to use them in pattern classification: 
we merely present the unknown item and evaluate the propositions and rules. Thus 
consider the long rule 


IF IsBlock(x) AND IsBlock(y) AND IsBlock(z) 
AND Touch(x, y) AND Touch(x, z) AND NotTouch(y, z) (11) 
AND Supports(x,y,z) THEN Arch(x, y,z), 


where Supports(x,y,z) means that x is supported by both y and z. We stress that 
designing algorithms to implement IsBlock(-), Supports(-,-,-) and so on can be 
extremely difficult; there is little we can say about them here other than that nearly 
always building these component algorithms represents the greatest effort in designing 
the overall classifier. Nevertheless, given reliable such algorithms, the rule could be 
used to classify simple structures as an arch or non-arch (Fig. 8.17). 


8.8.1 Learning rules 


Now we turn briefly to the learning of such if-then rules. We have already seen several 
ways to learn rules. For instance, we can train a decision tree via CART, 1D3, C4.5 
or other algorithm, and then simplify the tree to extract rules (Sect. 8.4). For cases 
where the underlying data arises from a grammar, we can infer the particular rules via 
the methods in Sect. 8.7. A key distinction in the approach we now discuss is that they 
can learn sets of first-order rules containing variables. As in grammatical inference, 
our approach to learning rules from a set of positive and negative examples, D* and 
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We 


Figure 8.17: The rule in Eq. 11 identifies the figure on the left as an example of Arch, 
but not the other two figures. In practice, it is very difficult to develop subsystems that 
evaluate the propositions themselves, for instance Touch(x,y) and Supports(x,y,z). 


D”, is to learn a single rule, delete the examples that it explains, and iterate. Such 
sequential covering learning algorithms lead to a disjunctive set of rules that “cover” 
the training data. After such training it is traditional to simplify the resulting logical 
rule by means of standard logical methods. 

The designer must specify the predicates and functions, based on a prior knowledge 
of the problem domain. The algorithm begins by considering the most general rules 
using these predicates and functions, and finds the “best” simple rule. Here, “best” 
means that the rule describes the largest number of training examples. Then, the 
algorithm searches among all refinements of the best rule, choosing the refinement 
that too is “best.” This process is iterated until no more refinements can be added, 
or when the number of items described is maximum. In this way a single, though 
possibly complex, if-then rule has been learned (Fig. 8.18). The sequential covering 
algorithm iterates this process and returns a set of rules. Because of its greedy nature, 
the algorithm need not learn the smallest set of rules. 


IF 
THEN Fish(x)=T 


IF HasHair (x) IF (Width(x)>2m) JF Swints (x) IF Runs (x) IF HasEyes (x an 
THEN Fish (x)=F THEN Fish(x)=F THEN Fish( (x)=T THEN Fish(x)=F THEN Fish (x)= 


IF ee) IF ee ) IF Swims (x) IF Swims (Xx) IF Swims (x) 
LaysEggs (x) ns (x) HasHair (x) HasScales (x) (Weight (x) >9kg) 
THEN Fish(x)=T THEN Fish(x)=F THEN Fish(x)=F THEN Fish(x)=T THEN Fish (x) =F 


IF Swims (x) IF Swims (x) IF Swims (x) 
HasScales (x) HasScales (x) HasScales (x) 
HasGills (x) HasEyes (x) (Length (x) >5m) 

THEN Fish(x)=T THEN Fish(x)=T THEN Fish(x) =F 


Figure 8.18: In sequential covering, candidate rules are searched through successive 
refinements. First, the “best” rule having a single conditional predicate is found, i.e., 
the one explaining most training data. Next, other candidate predicates are added, 
the best compound rule selected, and so forth. 


A general approach is to search first through all rules having a single attribute. 
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Next, consider the rule having a single conjunction of two predicates, then these 
conjunctions, and so on. Note that this greedy algorithm need not be optimal — that 
is, it need not yield the most compact rule. 


Summary 


Non-metric data consists of lists of nominal attributes; such lists might be unordered 
or ordered (strings). Tree-based methods such as CART, ID3 and C4.5 rely on answers 
to a series of questions (typically binary) for classification. The designer selects the 
form of question and the tree is grown, starting at the root node, by finding splits 
of data that make the representation more “pure.” There are several acceptable 
impurity measures, such as misclassification, variance and Gini; the entropy impurity, 
however, has found greatest use. To avoid overfitting and to improve generalization, 
one must either employ stopped splitting (declaring a node with non-zero impurity to 
be a leaf), or instead prune a tree trained to minimum impurity leafs. Tree classifiers 
are very flexible, and can be used in a wide range of applications, including those with 
data that is metric, non-metric or in combination. 

When comparing patterns that consist of strings of non-numeric symbols, we use 
edit distance — a measure of the number of fundamental operations (insertions, dele- 
tions, exchanges) that must be performed to transform one string into another. While 
the general edit distance is not a true metric, edit distance can nevertheless be used 
for nearest-neighbor classification. String matching is finding whether a test string 
appears in a longer text. The requirement of a perfect match in basic string matching 
can be relaxed, as in string matching with errors, or with the don’t care symbol. These 
basic string and pattern recognition ideas are simple and straightforward, addressing 
them in large problems requires algorithms that are computationally efficient. 

Grammatical methods assume the strings are generated from certain classes of 
rules, which can be described by an underlying grammar. A grammar G consists of 
an alphabet, intermediate symbols, a starting or root symbol and most importantly 
a set of rewrite rules, or productions. The four different types of grammars — free, 
context-sensitive, context-free, and regular — make different assumptions about the 
nature of the transformations. Parsing is the task of accepting a string x and deter- 
mining whether it is a member of the language generated by G, and if so, finding a 
derivation. Grammatical methods find greatest use in highly structured environments, 
particularly where structure lies at many levels. Grammatical inference generally uses 
positive and negative example strings (i.e., ones in the language generated by G and 
ones not in that language), to infer a set of productions. 

Rule-based systems use either propositional logic (variable-free) or first-order logic 
to describe a category. In broad overview, rules can be learned by sequentially “cov- 
ering” elements in the training set by successively more complex compound rules. 


Bibliographical and Historical Remarks 


Most work on decision trees addresses problems in continuous features, though a 
key property of the method is that they apply to nominal data too. Some of the 
foundations of tree-based classifiers stem from the Concept Learning System described 
in [42], but the important book on CART [10] provided a strong statistics foundation 
and revived interest in the approach. Quinlan has been a leading exponent of tree 
classifiers, introducing 1D3 [66], C4.5 [69], as well as the application of minimum 
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description length for pruning [71, 56]. A good overview is [61], and a comparison 
of multivariate decision tree methods is given in [11]. Splitting and pruning criteria 
based on probabilities are explored in [53], and the use of an interesting information 
metric for this end is described in [52]. The Gini index was first used in analysis 
of variance in categorical data [47]. Incremental or on-line learning in decision trees 
is explored in [85]. The missing variable problem in trees is addressed in [10, 67], 
which describe methods more general than those presented here. An unusual parallel 
“neural” search through trees was presented in [78]. 

The use of edit distance began in the 1970s [64]; a key paper by Wagner and Fischer 
proposed the fundamental Algorithm 3 and showed that it was optimal [88]. The 
explosion of digital information, especially natural language text, has motivated work 
on string matching and related operations. An excellent survey is [5] and two thorough 
books are [23, 82]. The computational complexity of string algorithms is presented 
in [21, Chapter 34]. The fast string matching method of Algorithm 2 was introduced 
in [9]; its complexity and speedups and improvements were discussed in [18, 35, 24, 
4, 40, 83]. String edit distance that permits block-level transpositions is discussed 
in [48]. Some sophisticated string operations — two-dimensional string matching, 
longest common subsequence and graph matching — have found only occasionally 
use in pattern recognition. Statistical methods applied to strings are discussed in 
[26]. Finite-state automata have been applied to several problems in string matching 
(23, Chapter 7], as well as time series prediction and switching, for instance converting 
from an alphanumeric representation to a binary representation [43]. String matching 
has been applied to the recognition DNA sequences and text, and is essential in most 
pattern recognition and template matching involving large databases of text [14]. 
There is a growing literature on special purpose hardware for string operations, of 
which the Splash-2 system [12] is a leading example. 

The foundations of a formal study of grammar, including the classification of 
grammars, began with the landmark book by Chomsky [16]. An early exposition 
of grammatical inference [39, Chapter 6] was the source for much of the discussion 
here. Recognition based on parsing (Latin pars orationis or “part of speech”) has 
been fundamental in automatic language recognition. Some of the earliest work on 
three-dimensional object recognition relied on complex grammars which described 
the relationships of corners and edges, in block structures such arches and towers. 
It was found that such systems were very brittle; they failed whenever there were 
errors in feature extraction, due to occlusion and even minor misspecifications of the 
model. For the most part, then, grammatical methods have been abandoned for object 
recognition and scene analysis [60, 25]. Grammatical methods have been applied to 
the recognition of some simple, highly structured diagrams, such as electrical circuits, 
simple maps and even Chinese/Japanese characters. For useful surveys of the basic 
ideas in syntactic pattern recognition see [33, 34, 32, 13, 62, 14], for parsing see [28, 3], 
for grammatical inference see [59]. The complexity of parsing type 3 is linear in the 
length of the string, type 2 is low-order polynomial, type 1 is exponential; pointers to 
the relevant literature appear in [76]. There has been a great deal of work on parsing 
natural language and speech, and a good textbook on artificial intelligence addressing 
this topic and much more is [75]. There is much work on inferring grammars from 
instances, such as Crespi-Reghizzi algorithm (context free) [22]. If queries can be 
presented interactively, the learning of a grammar can be speeded [81]. 

The methods described in this chapter have been expanded to allow for stochastic 
grammars, where there are probabilities associated with rules [20]. A grammar can 
be considered a specification of a prior probability for a class; for instance, a uniform 
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prior over all (legal) strings in the language £. Error-correcting parsers have been 
used when random variations arise in an underlying stochastic grammar [50, 84]. One 
can also apply probability measures to languages [8]. 

Rule-based methods have formed the foundation of expert systems, and have been 
applied extensively through many branches of artificial intelligence such as planning, 
navigation and prediction; their use in pattern recognition has been modest, however. 
Early influential systems include DENDRAL, for inferring chemical structure from 
mass spectra [29], PROSPECTOR, for finding mineral deposits [38], and MYCIN, 
for medical diagnosis [79]. Early use of rule induction for pattern recognition include 
that of Michalski [57, 58]. Figure 8.17 was inspired by Winston’s influential work 
on learning simple geometrical structures and relationships [91]. Learning rules can 
be called inductive logic programming; Clark and Niblett have made a number of 
contributions to the field, particularly their CN2 induction algorithm [17]. Quinlan, 
who has contributed much to the theory and application of tree-based classifiers, 
describes his FOIL algorithm, which uses a minimum description length criterion to 
stop the learning of first-order rules [68]. Texts on inductive logic include [46, 63] and 
general machine learning, including inferencing [44, 61]. 


Problems 


EH Section 8.2 


1. When a test pattern is classified by a decision tree, that pattern is subjected 
/ to a sequence of queries, corresponding to the nodes along a path from root to leaf. 
Prove that for any decision tree, there is a functionally equivalent tree in which every 
such path consists of distinct queries. That is, given an arbitrary tree prove that 
it is always possible to construct an equivalent tree in which no test pattern is ever 
subjected to the same query twice. 


HB Section 8.3 


2. Consider classification trees that are non-binary. 


(a) Prove that for any arbitrary tree, with possibly unequal branching ratios through- 
out, there exists a binary tree that implements the same classification function. 


(b) Consider a tree with just two levels — a root node connected to B leaf nodes 
(B > 2). What are the upper and the lower limits on the number of levels in a 
functionally equivalent binary tree, as a function of B? 


(c) As in part (b), what are the upper and lower limits on the number of nodes in 
a functionally equivalent binary tree? 


3. Compare the computational complexities of a monothetic and a polythetic tree 
classifier trained on the same data as follows. Suppose there are n/2 training patterns 
in each of two categories. Every pattern has d attributes, each of which can take on 
k discrete values. Assume that the best split evenly divides the set of patterns. 


(a) How many levels will there be in the monothetic tree? The polythetic tree? 


(b) In terms of the variables given, what is the complexity of finding the optimal 
split at the root of a monothetic tree? A polythetic tree? 
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(c) 


Compare the total complexities for training the two trees fully. 


4. The task here is to find the computational complexity of training a tree classifier 
using the twoing impurity where candidate splits are based on a single feature. Sup- 
pose there are c classes, w1, w2, ..., Wc, each with n/c patterns that are d-dimensional. 
Proceed as follows: 


(a) 


How many possible non-trivial divisions into two supercategories are there at 
the root node? 


For any one of these candidate supercategory divisions, what is the computa- 
tional complexity of finding the split that minimizes the entropy impurity? 


Use your results from parts (a) & (b) to find the computational complexity of 
finding the split at the root node. 


Suppose for simplicity that each split divides the patterns into equal subsets 
and furthermore that each leaf node corresponds to a single pattern. In terms 
of the variables given, what will be the expected number of levels of the tree? 


Naturally, the number of classes represented at any particular node will depend 
upon the level in the tree; at the root all c categories must be considered, while at 
the level just above the leaves, only 2 categories must be considered. (The pairs 
of particular classes represented will depend upon the particular node.) State 
some natural simplifying assumptions, and determine the number of candidate 
classes at any node as a function of level. (You may need to use the floor or 
ceiling notation, |x] or [a], in your answer, as described in the Appendix.) 


Use your results from part (e) and the number of patterns to find the compu- 
tational complexity at an arbitrary level L. 


Use all your results to find the computational complexity of training the full 
tree classifier. 


Suppose there n = 2!° patterns, each of which is d = 6 dimensional, evenly 
divided among c = 16 categories. Suppose that on a uniprocessor a fundamental 
computation requires roughly 107*% seconds. Roughly how long will it take to 
train your classifier using the twoing criterion? How long will it take to classify 
a single test pattern? 


5. Consider training a binary tree using the entropy impurity, and refer to Eqs. 1 & 


5. 


(a) 


(b) 


(c) 


Prove that the decrease in entropy impurity provided by a single yes/no query 
can never be greater than one bit. 


For the two trees in Example 1, verify that each split reduces the impurity 
and that this reduction is never greater than 1 bit. Explain nevertheless why 
the impurity at a node can be lower than at its descendent, as occurs in that 
Example. 


Generalize your result from part (a) to the case with arbitrary branching ratio 
B >2. 
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6. Let P(w1),..., P(w.) denote the probabilities of c classes at node N of a binary 


e 
classification tree, and 5) P(w;) = 1. Suppose the impurity i(P(w1),...,P(we)) at 
j=l 
N is a strictly concave function of the probabilities. That is, for any probabilities 


la = t(P%(a ),...,P%(we)) 
On = i(P?(w1),..., P?(we)) 
and 
“(a) = i(ayP%(w,) + (1 — a) P?(w), ..., ae P (we) + (1 — ae) P?(we)), 


then for 0 <a; < 1 and Y a; = 1, we have 
j=1 


la Si < ip. 


(a) Prove that for any split, we have Ai(s,t) > 0, with equality if and only if 
P(w;|Tr) = P(w;|Tr) = P(w,;|T), for j =1,...,c. In other words, for a concave 
impurity function, splitting never increases the impurity. 


(b) Prove that entropy impurity (Eq. 1) is a concave function. 


(c) Prove that Gini impurity (Eq. 3) is a concave function. 


7. Show that the surrogate split method described in the text corresponds to the 
assumption that the missing feature (attribute) is the one most informative. 

8. Consider a two-category problem and the following training patterns, each having 
four binary attributes: 


Wy wa 

o110 | 1011 
1010 | 0000 
0011 | 0100 
1111 | 1110 


(a) Use the entropy impurity (Eq. 1) to create by hand an unpruned classifier for 
this data. 


(b) Apply simple logical reduction methods to your tree in order to express each 
category by the simplest logical expression, i.e., with the fewest AN Ds and ORs. 


9. Show that the time complexity of recall in an unpruned, fully trained tree classifier 
with uniform branching ratio is O(log n) where n is the number of training patterns. 
For uniform branching factor, B, state the exact functional form of the number of 
queries applied to a test pattern as a function of B. 

10. Consider impurity functions for a two-category classification problem as a func- 
tion of P(w1) (and implicitly P(w2) = 1 — P(w1)). Show that the simplest reasonable 
polynomial form for the impurity is related to the sample variance as follows: 


(a) Consider restricting impurity functions to the family of polynomials in P(w). 
Explain why i must be at least quadratic in P(w1). 
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(b) 


(c) 


Write the simplest quadratic form for P(w¡) given the boundary conditions 
i(P(w,) = 0) = i(P(w1) = 1) = 0; show that your impurity function can be 
written i x P(w1)P(w2). 


Suppose all patterns in category wı are assigned the value 1.0, while all those in 
wa the value 0.0, thereby giving a bimodal distribution. Show that your impurity 
measure is proportional to the sample variance of this full distribution. Interpret 
your answer in words. 


11. Show how general costs, represented in a cost matrix \;;, can be incorporated 
into the misclassification impurity (Eq. 4) during the training of a multicategory tree. 
12. In this problem you are asked to create tree classifiers for a one-dimensional two- 
category problem in the limit of large number of points, where P(w1) = P(w2) = 1/2, 
p(a|lw,) ~ N(0,1) and p(z|w2) ~ N(1, 2), and all nodes have decisions of the form “is 
x < xs” for some threshold xs. Each binary tree will be small — just a root node plus 
two other (non-terminal) nodes and four leaf nodes. For each of the four impurity 
measures below, state the splitting criteria (i.e., the value x, at each of the three 
non-terminal nodes), as well as the final test error. Whenever possible, express your 
answers functionally, possibly using the error function erf(-), as well as numerically. 


(e) 


Entropy impurity (Eq. 1). 
Gini impurity (Eq. 3). 
Misclassification impurity (Eq. 4). 
Another splitting rule is based on the so-called Kolmogorov-Smirnov test. Let 
the cumulative distributions for a single variable x for each categories be F;(1) 
for i = 1,2. The splitting criterion is the maximum difference in the cumulative 
distributions, i.e., 

max |Fi (zs) — Fals)! 


Using the methods of Chap. ??, calculate the Bayes decision rule, and the Bayes 
error. 


13. Repeat Problem 12 but for two one-dimensional Cauchy distributions, 


1 1 . 
pele) = i . m ay’ i= 1,2, 


with P(w,) = P(w2) = 1/2, a, = 0, bı = 1, ag = 1 and bg = 2. (Here error functions 
are not needed.) 

14. Generalize the missing attribute problem to the case of several missing features, 
and to several deficient patterns. Specifically, write pseudocode for creating a binary 
decision tree where each d-dimensional pattern can have multiple missing features. 


15. 


During the growing of a decision tree, a node represents the following six- 


dimensional binary patterns: 


KOLMOGOROV- 
SMIRNOV 
TEST 
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Wy wa 

110101 | 011100 
101001 | 010100 
100001 | 011010 
101101 | 010000 
010101 | 001000 
111001 | 010100 
100101 | 111000 
011000 | 110101 


Candidate decision are based on single feature values. 


(a) Which feature should be used for splitting? 


(b) Recall the use of statistical significance for stopped splitting. What is the null 
hypothesis in this example? 


(c) Calculate chi-squared for your decision in part (a). Does it differ significantly 
from the null hypothesis at the 0.01 confidence level? Should splitting be 
stopped? 


(d) Repeat part (c) for the 0.05 level. 


16. Consider the following patterns, each having four binary-valued attributes: 


Y Ya 

1100 | 1100 
0000 | 1111 
1010 | 1110 
0011 | 0111 


Note especially that the first patterns in the two categories are the same. 


(a) Create by hand a binary classification tree for this data. Train your tree so that 
the leaf nodes have the lowest impurity possible. 


(b) Suppose it is known that during testing the prior probabilities of the two cat- 
egories will not be equal, but instead P(w1) = 2P(w2). Modify your training 
method and use the above data to form a tree for this case. 
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17. Consider training a binary decision tree to classify two-component patterns from 
two categories. The first component is binary, 0 or 1, while the second component 
has six possible values, A through F: 


Y | W2 
1A | OA 
OE | OC 
OB | 1C 
1B | OF 
1F | OB 
OD | 1D 
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Compare splitting the root node based on the first feature with splitting it on the 
second feature in the following way. 


(a) Use an entropy impurity with a two-way split (i.e., B = 2) on the first feature 
and a six-way split on the second feature. 


(b) Repeat (a) but using a gain ratio impurity. 
(c) In light of your above answers discuss the value of gain ratio impurity in cases 


where splits have different branching ratios. 
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18. Consider strings x and text, of length m and n, respectively, from an alphabet A 
consisting of d characters. Assume that the naive string-matching algorithm (Algo- 
rithm 1) exits the implied loop in line 4 as soon as a mismatch occurs. Prove that the 
number of character-to-character comparisons made on average for random strings is 


1-d-™ 
< . 
JA sm m-+1) 


(n—-m-+1 


19. Consider string matching using the Boyer-Moore algorithm (Algorithm 2) based 
on the trinary alphabet A = {a,b,c}. Apply the good-suffix function G and the 
last-occurrence function F to each of the following strings: 


66 


a) “acaccacbac” 


( 
(b 


“abababcbcbaaabcbaa” 


) 
) 

(c) “cccaaababaccc” 
) 


(d) “abbabbabbcbbabbcbba” 
20. Consider the string-matching problem illustrated in the top of Fig. 8.8. Assume 
text began at the first character of “probabilities.” 


(a) How many basic character comparisons are required by the naive string-matching 
algorithm (Algorithm 1) to find a valid shift? 


(b) How many basic character comparisons are required by the Boyer-Moore string 
matching algorithm (Algorithm 2)? 


21. For each of the texts below, determine the number of fundamental character 
comparisons needed to find all valid shifts for the test string x = “abcca” using 
the naive string-matching algorithm (Algorithm 1) and the Boyer-Moore algorithm 
(Algorithm 2). 


(a) “abcccdabacabbca” 
(b) “dadadadadadadad” 


66 


) 
(c) “abcbcabcabcabc” 
) 


(d 


accabcababacca” 
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(e) “bbccacbccabbcca” 


22. Write pseudocode for an efficient construction of the last-occurrence function F 
used in the Boyer-Moore algorithm (Algorithm 2). Let d be the number of elements 
in the alphabet A, and m the length of string x. 


(a) What is the time complexity of your algorithm in the worst case? 
(b) What is the space complexity of your algorithm in the worst case? 


(c) How many fundamental operations are required to compute F for the 26- 
letter English alphabet for x = “bonbon”? For x = “marmalade”? For x = 
“abcdabdabcaabcda”? 


23. Consider the training data from the trinary alphabet A = {a,b,c} in the table 


Wy w2 w3 
aabbc bccba caaaa 
ababcc | bbbca cbcaab 
babbcc | cbbaaaa | baaca 


Use the simple edit distance to classify each of the below strings. If there are ambi- 
guities in the classification, state which two (or all three) categories are candidates. 


(a) “ccab” 
(b) “abdca” 
(c) “abc” 
(d) “bacaca” 


25. Repeat Problem 23 but assume that the cost of different string transformations 
are not equal. In particular, assume that an interchange is twice as costly as an 
insertion or a deletion. 

26. Consider edit distance with positive but otherwise arbitrary costs associated 
with each of the fundamental operations of insertion, deletion and substitution. 


(a) Which of the criteria for a metric are always obeyed and which not necessarily 
obeyed? 


(b) For any criteria that are not always obeyed, construct a counterexample. 


27. Algorithm 3 employs a greedy heuristic for computing the edit distance between 
two strings x and y; it need not give a global optimum. In the following, let |x| = nı 
and |y| = no. 
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(a) State the computational complexity of an exhaustive examination of all trans- 
formations of x into y. (Assume that no transformation need be considered if 
it leads to a string shorter than Min[n,, ng] or longer than Mazx[n;,na].) 


(b) Recall from Chap. ?? the basic approach of linear programming. Write pseu- 
docode that would apply linear programming to the calculation of edit distances. 


28. Consider the general edit distance with positive costs and whether it has the 
four properties of a true metric: non-negativity, reflexivity, symmetry and the triangle 
inequality. 

29. Consider strings x and text, of length m and n, respectively, from an alphabet 
A consisting of d characters. 


(a) Modify the pseudocode of the naive string-matching algorithm to include the 
don’t care symbol. 


(b) Employ the assumptions of Problem 18 but also that x has exactly k don’t 
care symbols while text has none. Find the number of character-to-character 
comparisons made on average for otherwise random strings. 


(c) Show that in the limit of k = 0 your answer is closely related to that of Prob- 
lem 18. 


(d) What is your answer in part (b) in the limit k = m? 
HB Section 8.6 


30. Mathematical expressions in the computer language Lisp are of the form 
(operation operand, operandz) where spaces delineate potentially ambiguous sym- 
bols and expressions can be nested, for example (quotient (plus 4 9) 6). 


(a) Write a simple grammar for the four basic arithmetic operations plus, difference, 
times and quotient, applied to positive single-digit integers. Be sure to include 
parentheses in your alphabet. 


(b) Determine by hand whether each of the following candidate Lisp expressions can 
be derived in your grammar, and if so, show a corresponding derivation tree. 


(times (plus (difference 5 9)(times 3 8)) (quotient 2 6)) 
(7 difference 2) 

(quotient (7 plus 2) (plus 6 3)) 

((plus) (6 2)) 

e (difference (plus 5 9) (difference 6 8)). 


31. Consider the language L(G) = {a"b|n > 1}. 
(a) Construct by hand a grammar that generates this language. 
(b) Use G to form derivation trees for the strings “ab” and “aaaaab.” 


32. Consider the grammar G: A = {a,b,c}, S = S, T = {A, B} and 
P={S— cAb, A> aBa, B — aBa, B= cb). 
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(a) What type of grammar is G? 
(b) Prove that this grammar generates the language L(G) = [ca”cba”b|n > 1}. 
(c) Draw the derivation tree the following two strings: “caacbaab” and “cacbab.” 


PALINDROME 33. A palindrome is a sequence of characters that reads the same forward and 
backward, such as “i,” “tat,” “boob,” and “sitonapotatopanotis.” 


(a) Write a grammar that generates all palindromes using 26 English letters (no 
spaces). Use your grammar to show a derivation tree for “noon” and “bib.” 


(b) What type is your grammar (0, 1, 2 or 3)? 


(c) Write a grammar that generates all words that consist of a single letter followed 
by a palindrome. Use your grammar to show a derivation tree for “pi,” “too,” 
and “stat.” 


34. Consider the grammar G in Example 3. 
(a) How many possible derivations are there in G for numbers 1 through 999? 
(b) How many possible derivations are there for numbers 1 through 999,999? 


c) Does the grammar allow any of the numbers (up to six digits) to be pronounced 
g g 
in more than one way? 


35. Recall that e is the empty string, defined to have zero length, and no man- 
ifestation in a final string. Consider the following grammar G: A = {a}, S = S, 
T = {A, B,C, D, E} and eight rewrite rules: 


S — ACaB Ca => aaC’ 
=] CB >DB CB >E 
p= aD — Da aD — AC 
aE — Ea AE >e 


(a) Note how A and B mark the beginning and end of the sentence, respectively, 
and that C is a marker that doubles the number of as (while moving from left 
to right through the word). Prove that the language generated by this grammar 
is L(G) = {a?"|n > 0). 


b) Sho a derivation tree for “aaaa” and for “aaaaaaaa” (cf. Computer exer- 
W V puter exe 
cise ??). 


36. Explore the notion of Chomsky normal form in the following way. 


(a) Show that the grammar G with A = {a,b}, S = S, T = {A, B} and rewrite 
rules: 


S — bA OR aB 
P= A —>bAAORaS ORa >, 
B =aBBORbSORDb 


is not in Chomsky normal form. 
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(b) Show that grammar G” with A = {a,b}, S = S, T = {A, B, Ca, Co, Di, Do}, 
and 


S — CA OR Ca B Dı — AA 

A — CaS OR C, Dı OR a D > BB 

B — CS OR Ca Də OR b Ca >a 
Ch >b 


P= 


is in Chomsky normal form. 


— 
Q 
sit 


Show that G and G” are equivalent by converting the rewrite rules of G into 
those of G” in the following way. Note that the rules A > a and B > b of G 
are already acceptable. Now convert other rules of G appropriate for Chomsky 
normal form. First replace S — bA in G by S — C,A and Cp — b. Likewise, 
replace A —> aS by A > CaS and Ca — a. Continue in this way, keeping in 
mind the final form of the rewrite rules of G”. 


(d) Give a derivation of “aabab” in G and in G’. 
37. Prove that each of the following languages are not context-free. 
(a) L(G) = {a*bic*|i < j < k}. 

(b) L(G) = {a'|i a prime}. 

38. Consider a grammar with A = {a,b,c}, S = S, T = {A, B}, and 


S =>asBAORaBA AB— BA 
P= bB — bb bA — bc 
cA => cc aB — ab 


Prove that this grammar generates the language L(G) = [a”b”c”|n > 1}. 
39. Try to parse by hand the following utterances. For each successful parse, show 
the corresponding derivation tree. 


e three hundred forty two thousand six hundred nineteen 
e thirteen 

e nine hundred thousand 

e two thousand siz hundred thousand five 


e one hundred sixty eleven 
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40. Let Dı = {ab, abb, abbb} and Da = {ba, aba, babb} be positive training examples 
from two grammars, Gi and Go, respectively. 


(a) Suppose both grammars are of type 3. Generate some candidate rewrite rules. 
(b) Infer grammar G, using Da as negative examples. 


c) Infer grammar Ga using Dı as negative examples. 
g g 8 
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(d) Use your trained grammars to classify the following sentences; label any sentence 
that cannot be parsed in either grammar as “ambiguous”: “bba,” “abab,” “bbb” and 
“abbbb.” 
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41. For each of the below, write an rule giving an equivalent relation using any of 
the following predicates: Male(-), Female(-), Parent (-,:), Married(.,-), 


(a) Sister(-,:), where Sister(x,y) = True means that x is the sister of y. 
(b) Father(-,-), where Father(x,y) = True means that x is the father of y. 


(c) Grandmother (-,-), where Grandmother (x,y) = True means that x is the grand- 
mother of y. 


(d) Husband(-,-), where Husband(x,y) = True means that x is the husband of y. 
(e) IsWife(-), where IsWife(x) = True means that simply that x is a wife. 
(£) Siblings(.,-) 


(g) FirstCousins(-,-) 


Computer exercises 


Several exercises will make use of the following data sampled from three categories. 
Each of the five features takes on a discrete feature, indicated by the range listed at 
the along the top. Note particularly that there are different number of samples in 
>> each category, and that the number of possible values for the features is not the same. 


LE ú For instance, the first feature can take on four values (A — D, inclusive), while the 


last feature can take on just two (M — N). 


sample | category | A-D E-G H-J K-L M-N 
1 wI A E H K M 
2 wi B E I L M 
3 wy A G I L N 
4 Wy B G H K M 
5 wy A G I L M 
6 wa B F I L M 
7 wa B F J L N 
8 wa B E I L N 
9 w2 C G J K N 
10 wa C G J L M 
11 wa D G J K M 
12 w2 B D I L M 
13 w3 D E H K N 
14 w3 A E H K N 
15 w3 D E H L N 
16 w3 D F J L N 
17 w3 A F H K N 
18 w3 D E J L M 
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1. Write a general program for growing a binary tree and use it to train a tree fully 
using the data from the three categories in the table, using an entropy impurity. 


(a) Use the (unpruned) tree to classify the following patterns: {A,E,I,L,N}, 
La, E, J, K, N}, 18, F, J, K, M}, and {C, D, J, L, N}. 


(b) Prune one pair of leafs, increasing the entropy impurity as little as possible. 


(c) Modify your program to allow for non-binary splits, where the branching ratio 
B as is determined at each node during training. Train a new tree fully using a 
gain ratio impurity and then classify the points in (a). 


2. Recall that one criterion for stopping the growing of a decision tree is to halt 
splitting when the best split reduces the impurity by less than some threshold value, 
that is, when max, Ai(s) < 8 where s indicates the split and 8 is the threshold. 
Explore the relationship between classifier generalization and 3 through the following 
simulations. 


(a) Generate 200 training points, 100 each for two two-dimensional Gaussian dis- 


tributions: p(x|w1) ~ MES OD and p(x|w2) ~ NM D. Also use your 


program to generate an independent test set of 100 points, 50 each of the cate- 
gories. 


(b) Write a program to grow a tree classifier, where a node is not split if max, Ai(s) < 


b. 


(c) Plot the generalization error of your tree classifier versus 8 for 6 = 0.01 — 1 in 
steps of 0.01, as estimated on the test data generated in part (a). 


(d) In light of your plot, discuss the relationship between 8 and generalization error. 


3. Repeat all parts of Computer exercise 2, but instead of considering Ø, focus 
instead on the role of a as used in Eq. 8. 
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4. Write a program for training an 1D3 decision tree in which the branching ratio 
B at each node is equal to the number of discrete “binned” possible values for each 
attribute. Use a gain ratio impurity. 


(a) Use your program to train a tree fully with the w; and wa patterns in the table 
above. 


(b) Use your tree to classify {B,G,I, K, Ny and {C, D, J, L, M}. 


(c) Write a logical expression which describes the decisions in part (b). Simplify 
these expressions. 


(d) Convert the information in your tree into a single logical expression which de- 
scribes the w; category. Repeat for the wa category. 


5. Consider the issue of tree-based classifiers and deficient patterns. 
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(a) Write a program to generate a binary decision tree for categories wı and wa 
using samples points 1 — 10 from the table above and an entropy impurity. For 
each decision node store the primary split and four surrogate splits. 


(b) Use your tree to classify the following patterns, where as usual * denotes a 
missing feature. 
e {A,F,H,K,M} 
e {«,G,H,K,M} 
e [C,F,1,L,N) 
e {B,x,*, K, N} 
(c) Now write a program to train a tree using deficient points. Train with sample 
points 1 — 10 from the table, used in part (a), as well as the following four points: 
e w: {*, FI, k,N} 
e w: {B,G,H, K, x} 
e w: {0,G,*x, L, N} 
e wo: {x,F,I, K, N} 


(d) Use your tree from part (c) to classify the test points in part (b). 


6. Train a tree classifier to distinguish all three categories w;, i = 1,2,3, using all 
20 sample points in the table above. Use an entropy criterion without pruning or 
stopping. 


(a) Express your tree as a set of rules. 


(b) Through exhaustive search, find the rule or rules, which when deleted, lead to 
the smallest increase in classification error as estimated by the training data. 
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7. Write a program to implement the naive string-matching algorithm (Algorithm 1). 

Insert a conditional branch so as to exit the innermost loop whenever a mismatch 
occurs (i.e., the shift is found to be invalid). Add a line to count the total number of 
character-to-character comparisons in the complete string search. 


(a) Write a small program to generate a text of n characters, taken from an alphabet 
having d characters. Let d = 5 and use your program to generate a text of length 
n = 1000 and a test string x of length m = 10. 


(b) Compare the number of character-to-character comparisons performed by your 
program with the theoretical result quoted in Problem 18 for all pairs of the 
following parameters: m = (10, 15,20) and n = (100, 1000, 10000}. 


8. Write a program to implement the Boyer-Moore algorithm (Algorithm 2) in the 
following steps. Throughout let the alphabet have d characters. 


(a) Write a routine for constructing the good-suffix function G. Let d = 3 and apply 
your routine to the strings x; = “abcbab”and xz = “babab.” 
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(b) Write a routine for constructing the last-occurrence function F. Apply it to the 
strings x; and x2 of part (a). 


(c) Write an implementation of the Boyer-Moore algorithm incorporating your rou- 
tines from parts (a) and (b). Generate a text of n = 10000 characters chosen 
from the alphabet A = {a,b,c}. Use your program to search for x; in text, and 
again xa in text. 


(d) Make some statistical assumptions to estimate the number of occurrences of xı 
and x2 in text, and compare that number to your answers in part (c). 


9. Write an algorithm for addressing the subset-superset problem in string matching. 
That is, search a text with several strings, some of which are factors of others. 


(a) Let xı = “beats,” X2 = “beat,” x3 = “be,” x4 = “at,” x5 = “eat,” xg = “sat.” 
Search for these possible factors in text = “beats_beats_beats_..._beats,” but 
eS 


100 x 
do not return any strings that are factors of other test strings found in tezt. 


” 


(b) Repeat with text consisting of 100 appended copies of “repeatable_,” and the 
test items “repeatable,” “pea,” “table,” “tab,” “able,” “peat,” and “a.” 


10. String matching with errors. Test on segments xxxx 
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11. Write a parser for the grammar described in the text: A = {a,b}, T = {A, B}, 
Pi: S ~ABORBC 
p: A—BAORA 
ps: B —-CCORD 
pa: C — ABORa 

Use your program to attempt to parse each of the following strings. In all cases, 
show the parse tree; for each successful parse show moreover the corresponding deriva- 
tion tree. 


S=SandP = 


e “aaaabbab” 


e “ba” 

e “baabab” 
e “babab” 
e “aaa” 

e “baaa” 
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12. Write a program to infer a grammar G from the following positive and negative 
examples: 


e Dt = {abc, aabbcc, aaabbbccc} 


e D- = {abbc, abcc, aabcc} 
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Take the following as candidate rewrite rules: 


S—aSBA AB — BA cB — aC 
S — bSBA BA — AB bA — be 


S — aBA bB => bb bC — be 
S=>asB bC — ba aB — ab 
S—aSA cA — cc aB — ca 


Proceed as follows: 


(a) Implement the general bottom-up parser of Algorithm 4. 
(b) Implement the general grammatical inferencing method of Algorithm 5. 


(c) Use your programs in conjunction to infer G from the data. 
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Chapter 9 


Algorithm-independent 
machine learning 


9.1 Introduction 


n the previous chapters we have seen many learning algorithms and techniques for 
dees recognition. When confronting such a range of algorithms, every reader 
has wondered at one time or another which one is “best.” Of course, some algorithms 
may be preferred because of their lower computational complexity; others may be 
preferred because they take into account some prior knowledge of the form of the 
data (e.g., discrete, continuous, unordered list, string, ...). Nevertheless there are 
classification problems for which such issues are of little or no concern, or we wish 
to compare algorithms that are equivalent in regard to them. In these cases we are 
left with the question: Are there any reasons to favor one algorithm over another? 
For instance, given two classifiers that perform equally well on the training set, it 
is frequently asserted that the simpler classifier can be expected to perform better 
on a test set. But is this version of Occam’s razor really so evident? Likewise, 
we frequently prefer or impose smoothness on a classifier’s decision functions. Do 
simpler or “smoother” classifiers generalize better, and if so, why? In this chapter 
we address these and related questions concerning the foundations and philosophical 
underpinnings of statistical pattern classification. Now that the reader has intuition 
and experience with individual algorithms, these issues in the theory of learning may 
be better understood. 

In some fields there are strict conservation laws and constraint laws — such as the 
conservation of energy, charge and momentum in physics, or the second law of ther- 
modynamics, which states that the entropy of an isolated system can never decrease. 
These hold regardless of the number and configuration of the forces at play. Given 
the usefulness of such laws, we naturally ask: are there analogous results in pattern 
recognition, ones that do not depend upon the particular choice of classifier or learn- 
ing method? Are there any fundamental results that hold regardless of the cleverness 
of the designer, the number and distribution of the patterns, and the nature of the 
classification task? 

Of course it is very valuable to know that there exists a constraint on classifier 
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accuracy, the Bayes limit, and it is sometimes useful to compare performance to this 
theoretical limit. Alas in practice we rarely if ever know the Bayes error rate. Even if 
we did know this error rate, it would not help us much in designing a classifier; thus 
the Bayes error is generally of theoretical interest. What other fundamental principles 
and properties might be of greater use in designing classifiers? 

Before we address such problems, we should clarify the meaning of the title of this 
chapter. “Algorithm-independent” here refers, first, to those mathematical founda- 
tions that do not depend upon the particular classifier or learning algorithm used. Our 
upcoming discussion of bias and variance is just as valid for methods based on neural 
networks as for the nearest-neighbor or for model-dependent maximum likelihood. 
Second, we mean techniques that can be used in conjunction with different learning 
algorithms, or provide guidance in their use. For example, cross validation and re- 
sampling techniques can be used with any of a large number of training methods. Of 
course by the very general notion of an algorithm these too are algorithms, technically 
speaking, but we discuss them in this chapter because of their breadth of applicability 
and independence from the details of the learning techniques encountered up to here. 

In this chapter we shall see, first, that no pattern classification method is inher- 
ently superior to any other, or even to random guessing; it is the type of problem, 
prior distribution and other information that determine which form of classifier should 
provide the best performance. We shall then explore several ways to quantify and ad- 
just the “match” between a learning algorithm and the problem it addresses. In any 
particular problem there are differences between classifiers, of course, and thus we 
show that with certain assumptions we can estimate their accuracy (even for instance 
before the candidate classifier is fully trained) and compare different classifiers. Fi- 
nally, we shall see methods for integrating component or “expert” classifiers, which 
themselves might implement any of a number of algorithms. 

We shall present the results that are most important for pattern recognition prac- 
titioners, occasionally skipping over mathematical details that can be found in the 
original research referenced in the Bibliographical and Historical Remarks section. 


9.2 Lack of inherent superiority of any classifier 


We now turn to the central question posed above: If we are interested solely in the 
generalization performance, are there any reasons to prefer one classifier or learning 
algorithm over another? If we make no prior assumptions about the nature of the 
classification task, can we expect any classification method to be superior or inferior 
overall? Can we even find an algorithm that is overall superior to (or inferior to) 
random guessing? 


9.2.1 No Free Lunch Theorem 


As summarized in the No Free Lunch Theorem, the answer to these and several 
related questions is no: on the criterion of generalization performance, there are 
no context- or problem-independent reasons to favor one learning or classification 
method over another. The apparent superiority of one algorithm or set of algorithms 
is due to the nature of the problems investigated and the distribution of data. It 
is an appreciation of the No Free Lunch Theorem that allows us, when confronting 
practical pattern recognition problems, to focus on the aspects that matter most 
— prior information, data distribution, amount of training data and cost or reward 
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functions. The Theorem also justifies a scepticism about studies that purport to 
demonstrate the overall superiority of a particular learning or recognition algorithm. 

When comparing algorithms we sometimes focus on generalization error for points 
not in the training set D, rather than the more traditional independent identically 
distributed or i.i.d. case. We do this for several reasons: First, virtually any powerful 
algorithm such as the nearest-neighbor algorithm, unpruned decision trees, or neural 
networks with sufficient number of hidden nodes, can learn the training set. Second, 
for low-noise or low-Bayes error cases, if we use an algorithm powerful enough to learn 
the training set, then the upper limit of the i.i.d. error decreases as the training set 
size increases. In short, it is the off-training set error — the error on points not in 
the training set — that is a better measure for distinguishing algorithms. Of course, 
for most applications the final performance of a fielded classifier is the full i.i.d. error. 

For simplicity consider a two-category problem, where the training set D consists 
of patterns x’ and associated category labels y; = +1 for i = 1,...,n generated by 
the unknown target function to be learned, F(x), where y; = F(x’). In most cases 
of interest there is a random component in F(x) and thus the same input could lead 
to different categories, giving non-zero Bayes error. At first we shall assume that the 
feature set is discrete; this simplifies notation and allows the use of summation and 
probabilities rather than integration and probability densities. The general conclu- 
sions hold in the continuous case as well, but the required technical details would 
cloud our discussion. 

Let H denote the (discrete) set of hypotheses, or possible sets of parameters to be 
learned. A particular hypothesis h(x) € H could be described by quantized weights in 
a neural network, or parameters 0 in a functional model, or sets of decisions in a tree, 
etc. Further, P(h) is the prior probability that the algorithm will produce hypothesis 
h after training; note that this is not the probability that h is correct. Next, P(h|D) 
denotes the probability that the algorithm will yield hypothesis h when trained on 
the data D. In deterministic learning algorithms such as the nearest-neighbor and 
decision trees, P(h|D) will be everywhere zero except for a single hypothesis h. For 
stochastic methods, such as neural networks trained from random initial weights, or 
stochastic Boltzmann learning, P(h|D) will be a broad distribution. For a general loss 
function L(-,-) we let E = L(-,-) be the scalar error or cost. While the natural loss 
function for regression is a sum-square error, for classification we focus on zero-one 
loss, and thus the generalization error is the expected value of E. 

How shall we judge the generalization quality of a learning algorithm? Since we 
are not given the target function, the natural measure is the expected value of the 
error given D, summed over all possible targets. This scalar value can be expressed as 
a weighted “inner product” between the distributions P(h|D) and P(F'|D), as follows: 


E[E[D] = Y Y PN - 5(F (x), h(x))P(AD)P(F[D), (1) 


h,F x¢D 


where for the moment we assume there is no noise. The familiar Kronecker delta func- 
tion, 6(-,-), has value 1 if its two arguments match, and value 0 otherwise. Equation 1 
states that the expected error rate, given a fixed training set D, is related to the sum 
over all possible inputs weighted by their probabilities, P(x), as well as the “align- 
ment” or “match” of the learning algorithm, P(h|D), to the actual posterior P(F'|D). 
The important insight provided by this equation is that without prior knowledge con- 
cerning P(F|D), we can prove little about any particular learning algorithm P(h|D), 
including its generalization performance. 


OFF- 
TRAINING 
SET ERROR 
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The expected off-training set classification error when the true function is F(x) 
and some candidate learning algorithm is P,(h(x)|D) is given by 


Ex(E|F,n) = X P(x)[1 — 5(F (x), h(x))] P: (h(x)|D). (2) 
x¢D 
With this background and the terminology of Eq. 2 we can now turn to a formal 
statement of the No Free Lunch Theorem. 


Theorem 9.1 (No Free Lunch) For any two learning algorithms P\(h|D) and Pa(h[D), 
the following are true, independent of the sampling distribution P(x) and the number 
n of training points: 


1. Uniformly averaged over all target functions F, E¡(E|F,n) — €o(E|F,n) = 0; 


2. For any fixed training set D, uniformly averaged over F, €\(E|F,D)—€2(E|F,D) = 
0; 


3. Uniformly averaged over all priors P(F), E¡(Eln) — €2(E|n) = 0; 


4. For any fixed training set D, uniformly averaged over P(F), €\(E|D)—€&2(E|D) = 
0.* 


Part 1 says that uniformly averaged over all target functions the expected error 
for all learning algorithms is the same, i.e., 


NY POIF)[EsE|F,n) - &(E|F,n)] = 0, (3) 
F D 


for any two learning algorithms. In short, no matter how clever we are at choosing 
a “good” learning algorithm P;(h|D), and a “bad” algorithm P:(h|D) (perhaps even 
random guessing, or a constant output), if all target functions are equally likely, then 
the “good” algorithm will not outperform the “bad” one. Stated more generally, 
there are no i and j such that for all F(x), €;(E|F,n) > €;(E|F,n). Furthermore, no 
matter what algorithm you use, there is at least one target function for which random 
guessing is a better algorithm. 

Assuming the training set can be learned by all algorithms in question, then Part 2 
states that even if we know D, then averaged over all target functions no learning 
algorithm yields an off-training set error error that is superior to any other, i.e., 


Y [E (EIF, D) - £(E|F,D)] = 0. (4) 
F 
Parts 3 & 4 concern non-uniform target function distributions, and have related in- 
terpretations (Problems 2 — 5). Example 1 provides an elementary illustration. 


Example 1: No Free Lunch for binary data | 


Consider input vectors consisting of three binary features, and a particular target 
function F(x), as given in the table. Suppose (deterministic) learning algorithm 1 
assumes every pattern is in category wı unless trained otherwise, and algorithm 2 
assumes every pattern is in wa unless trained otherwise. Thus when trained with 


* The clever name for the Theorem was suggested by David Haussler. 
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n = 3 points in D, each algorithm returns a single hypothesis, hı and ha, respectively. 
In this case the expected errors on the off-training set data are €;(E£|F,D) = 0.4 and 
Es(E|F,D) = 0.6. 


x F hy ha 
000 | 1 1 1 


010 | 1 1 1 


011 | -1 1 | -1 
100 | 1 1| -1 
101 | -1 1 | -1 
110 | 1 1| -1 
111] 1 1| -1 


For this target function F(x), clearly algorithm 1 is superior to algorithm 2. But 
note that the designer does not know F(x) — indeed, we assume we have no prior 
information about F(x). The fact that all targets are equally likely means that D 
provides no information about F(x). If we wish to compare the algorithms overall, 
we therefore must average over all such possible target functions consistent with the 
training data. Part 2 of Theorem 9.1 states that averaged over all possible target 
functions, there is no difference in off-training set errors between the two algorithms. 
For each of the 2° distinct target functions consistent with the n = 3 patterns in 
D, there is exactly one other target function whose output is inverted for each of the 
patterns outside the training set, and this ensures that the performances of algorithms 
1 and 2 will also be inverted, so that the contributions to the formula in Part 2 cancel. 
Thus indeed Part 2 of the Theorem as well as Eq. 4 are obeyed. 


Figure 9.1 illustrates a result derivable from Part 1 of Theorem 9.1. Each of the 
six squares represents the set of all possible classification problems; note that this is 
not the standard feature space. If a learning system performs well — higher than 
average generalization accuracy — over some set of problems, then it must perform 
worse than average elsewhere, as shown in a). No system can perform well throughout 
the full set of functions, d); to do so would violate the No Free Lunch Theorem. 

In sum, all statements of the form “learning/recognition algorithm 1 is better than 
algorithm 2” are ultimately statements about the relevant target functions. There 
is, hence, a “conservation theorem” in generalization: for every possible learning 
algorithm for binary classification the sum of performance over all possible target 
functions is exactly zero. Thus we cannot achieve positive performance on some 
problems without getting an equal and opposite amount of negative performance on 
other problems. While we may hope that we never have to apply any particular 
algorithm to certain problems, all we can do is trade performance on problems we do 
not expect to encounter with those that we do expect to encounter. This, and the 
other results from the No Free Lunch Theorem, stress that it is the assumptions about 
the learning domains that are relevant. Another practical import of the Theorem is 
that even popular and theoretically grounded algorithms will perform poorly on some 
problems, ones in which the learning algorithm and the posterior happen not to be 
“matched,” as governed by Eq. 1. Practitioners must be aware of this possibility, which 
arises in real-world applications. Expertise limited to a small range of methods, even 
powerful ones such as neural networks, will not suffice for all classification problems. 
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possible 
learning systems 


problem space 
(not feature space) 


impossible 
learning systems 


Figure 9.1: The No Free Lunch Theorem shows the generalization performance on 
the off-training set data that can be achieved (top row), and the performance that 
cannot be achieved (bottom row). Each square represents all possible classification 
problems consistent with the training data — this is not the familiar feature space. 
A + indicates that the classification algorithm has generalization higher than average, 
a - indicates lower than average, and a 0 indicates average performance. The size of 
a symbol indicates the amount by which the performance differs from the average. 
For instance, a) shows that it is possible for an algorithm to have high accuracy on a 
small set of problems so long as it has mildly poor performance on all other problems. 
Likewise, b) shows that it is possible to have excellent performance throughout a large 
range of problem but this will be balanced by very poor performance on a large range 
of other problems. It is impossible, however, to have good performance throughout 
the full range of problems, shown in d). It is also impossible to have higher than 
average performance on some problems, and average performance everywhere else, 
shown in e). 


Experience with a broad range of techniques is the best insurance for solving arbitrary 
new classification problems. 


9.2.2 *Ugly Duckling Theorem 


While the No Free Lunch Theorem shows that in the absence of assumptions we should 
not prefer any learning or classification algorithm over another, an analogous theorem 
addresses features and patterns. Roughly speaking, the Ugly Duckling Theorem states 
that in the absence of assumptions there is no privileged or “best” feature represen- 
tation, and that even the notion of similarity between patterns depends implicitly on 
assumptions which may or may not be correct. 

Since we are using discrete representations, we can use logical expressions or 
“predicates” to describe a pattern, much as in Chap. ??. If we denote a binary 
feature attribute by f;, then a particular pattern might be described by the predicate 
“fı AND fe,” another pattern might be described as “NOT f2,” and so on. Like- 
wise we could have a predicate involving the patterns themselves, such as x; OR x2. 
Figure 9.2 shows how patterns can be represented in a Venn diagram. 

Below we shall need to count predicates, and for clarity it helps to consider a 
particular Venn diagram, such as that in Fig. 9.3. This is the most general Venn 
diagram based on two features, since for every configuration of fı and f2 there is 
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a) b) c) 


X4 


*; 


Figure 9.2: Patterns x;, represented as d-tuples of binary features f;, can be placed 
in Venn diagram (here d = 3); the diagram itself depends upon the classification 
problem and its constraints. For instance, suppose fı is the binary feature attribute 
has_legs, f2 is has_right_arm and fs the attribute has_right_hand. Thus in part a) 
pattern x; denotes a person who has legs but neither arm nor hand; x2 a person who 
has legs and an arm, but no hand; and so on. Notice that the Venn diagram expresses 
the biological constraints associated with real people: it is impossible for someone to 
have a right hand but no right arm. Part c) expresses different constraints, such as 
the biological constraint of mutually exclusive eye colors. Thus attributes fı, fo and 
fs might denote brown, green and blue respectively and a pattern x; describes a real 
person, whom we can assume cannot have eyes that differ in color. 


indeed a pattern. Here predicates can be as simple as “x4,” or more complicated, 
such as “xy OR x2 OR x4,” and so on. 


X4 


Figure 9.3: The Venn for a problem with no constraints on two features. Thus all 
four binary attribute vectors can occur. 


The rank r of a predicate is the number of the simplest or indivisible elements it 
contains. The tables below show the predicates of rank 1, 2 and 3 associated with the 
Venn diagram of Fig. 9.3.* Not shown is the fact that there is but one predicate of 
rank r = 4, the disjunction of the x1,...,X4, which has the logical value True. If we 
let n be the total number of regions in the Venn diagram (i.e., the number of distinct 
possible patterns), then there are (") predicates of rank r, as shown at the bottom of 
the table. 


* Technically speaking, we should use set operations rather than logical operations when discussing 
the Venn diagram, writing x; U xa instead of xı OR x2. Nevertheless we use logical operations 
here for consistency with the rest of the text. 


RANK 
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rank r= 1 rank r = rank r= 3 
X1 fi AND NOT fo X1 OR X2 fi X1 OR X2 OR X3 fi OR f2 
X2 fi AND f2 X1 OR X3 fi XOR fe X1 OR X2 OR X4 fi OR NOT fe 
x3 | fo AND NOT fi xı OR x4 | NOT fo xı OR x3 OR x3 | NOT(fi AND fa) 
X4 NOT(fi OR fo) Xə OR X3 fa Xə OR X3 OR X4 fo OR NOT S;, 

x2 OR x4 | NOT(fi AND fə) 

X3 OR X4 NOT fi 


(i) =4 (2) =6 (5) =4 


The total number of predicates in the absence of constraints is 


> (") = (141) =2", (5) 


r=0 


and thus for the d = 4 case of Fig. 9.3, there are 24 = 16 possible predicates (Prob- 
lem 9). Note that Eq. 5 applies only to the case where there are no constraints; for 
Venn diagrams that do incorporate constraints, such as those in Fig. 9.2, the formula 
does not hold (Problem 10). 

Now we turn to our central question: In the absence of prior information, is there 
a principled reason to judge any two distinct patterns as more or less similar than two 
other distinct patterns? A natural and familiar measure of similarity is the number 
of features or attributes shared by two patterns, but even such an obvious measure 
presents conceptual difficulties. 

To appreciate such difficulties, consider first a simple example. Suppose attributes 
fı and fg represent blind_in_right_eye and blind_in_left_eye, respectively. If we 
base similarity on shared features, person x; = {1,0} (blind only in the right eye) is 
maximally different from person x2 = (0, 1) (blind only in the left eye). In particular, 
in this scheme x; is more similar to a totally blind person and to a normally sighted 
person than he is to x2. But this result may prove unsatisfactory; we can easily 
envision many circumstances where we would consider a person blind in just the right 
eye to be “similar” to one blind in just the left eye. Such people might be permitted 
to drive automobiles, for instance. Further, a person blind in just one eye would differ 
significantly from totally blind person who would not be able to drive. 

A second, related point is that there are always multiple ways to represent vectors 
(or tuples) of attributes. For instance, in the above example, we might use alter- 
native features fj and f5 to represent blind_in_right_eye and same_in_both_eyes, 
respectively, and then the four types of people would be represented as shown in the 
tables. 


h f fi f 
x, | O 0 0 1 
X3 1 0 1 0 
X4 1 1 1 1 


Of course there are other representations, each more or less appropriate to the par- 
ticular problem at hand. In the absence of prior information, though, there is no 
principled reason to prefer one of these representations over another. 
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We must then still confront the problem of finding a principled measure the simi- 
larity between two patterns, given some representation. The only plausible candidate 
measure in this circumstance would be the number of predicates (rather than the 
number of features) the patterns share. Consider two distinct patterns (in some rep- 
resentation) x; and x,;, where i 4 j. Regardless of the constraints in the problem 
(i.e., the Venn diagram), there are, of course, no predicates of rank r = 1 that are 
shared by the two patterns. There is but one predicate of rank r = 2, i.e., x; OR xj. 
A predicate of rank r = 3 must contain three patterns, two of which are x; and xj. 
Since there are d patterns total, there are then (=) = d—2 predicates of rank 3 that 
are shared by x; and xj. Likewise, for an arbitrary rank r, there are (Z) predicates 
shared by the two patterns, where 2 < r < d. The total number of predicates shared 
by the two patterns is thus the sum 


> (es a = (1412 = 24-2, (6) 


r—2 


Note the key result: Eq. 6 is independent of the choice of x; and x; (so long as they 
are distinct). Thus we conclude that the number of predicates shared by two distinct 
patterns is constant, and independent of the patterns themselves (Problem 11). We 
conclude that if we judge similarity based on the number of predicates that patterns 
share, then any two distinct patterns are “equally similar.” This is stated formally 
as: 


Theorem 9.2 (Ugly Duckling) Given that we use a finite set of predicates that en- 
ables us to distinguish any two patterns under consideration, the number of predicates 
shared by any two such patterns is constant and independent of the choice of those 
patterns. Furthermore, if pattern similarity is based on the total number of predicates 
shared by two patterns, then any two patterns are “equally similar.” * 


In summary, then, the Ugly Duckling Theorem states something quite simple yet 
important: there is no problem-independent or privileged or “best” set of features or 
feature attributes. Moreover, while the above was derived using d-tuples of binary 
values, it also applies to a continuous feature spaces too, if such as space is discretized 
(at any resolution). The Theorem forces us to acknowledge that even the appar- 
ently simple notion of similarity between patterns is fundamentally based on implicit 
assumptions about the problem domain (Problem 12). 


9.2.3 Minimum description length (MDL) 


It is sometimes claimed that the minimum description length principle provides jus- 
tification for preferring one type of classifier over another — specifically “simpler” 
classifiers over “complex” ones. Briefly stated, the approach purports to find some ir- 
reducible, smallest representation of all members of a category (much like a “signal” ); 
all variation among the individual patterns is then “noise.” The principle argues that 
by simplifying recognizers appropriately, the signal can be retained while the noise is 
ignored. Because the principle is so often invoked, it is important to understand what 
properly derives from it, what does not, and how it relates to the No Free Lunch 


* The Theorem gets its fanciful name from the following counter-intuitive statement: Assuming 
similarity is based on the number of shared predicates, an ugly duckling A is as similar to beautiful 
swan B as beautiful swan C is to B, given that these items differ from one another. 


ABSTRACT 
COMPUTER 
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Theorem. To do so, however, we must first understand the notion of algorithmic 
complexity. 


Algorithmic complexity 


Algorithmic complexity — also known as Kolmogorov complexity, Kolmogorov-Chaitin 
complexity, descriptional complexity, shortest program length or algorithmic entropy 
— seeks to quantify an inherent complexity of a binary string. (We shall assume both 
classifiers and patterns are described by such strings.) Algorithmic complexity can be 
explained by analogy to communication, the earliest application of information theory 
(App. ??). If the sender and receiver agree upon a specification method L, such as 
an encoding or compression technique, then message x can then be transmitted as y, 
denoted L(y) = x or y : L(y) = x. The cost of transmission of x is the length of the 
transmitted message y, that is, |y|. The least such cost is hence the minimum length 


of such a message, denoted AR : L(y) = x; this minimal length is the entropy of x 
y 
under the specification or transmission method L. 


Algorithmic complexity is defined by analogy to entropy, where instead of a spec- 
ification method L, we consider programs running on an abstract computer, i.e., one 
whose functions (memory, processing, etc.) are described operationally and without 
regard to hardware implementation. Consider an abstract computer that takes as a 
program a binary string y and outputs a string x and halts. In such a case we say 
that y is an abstract encoding or description of x. 

A universal description should be independent of the specification (up to some ad- 
ditive constant), so that we can compare the complexities of different binary strings. 
Such a method would provide a measure of the inherent information content, the 
amount of data which must be transmitted in the absence of any other prior knowl- 
edge. The Kolmogorov complexity of a binary string x, denoted K(x), is defined as 
the size of the shortest program y, measured in bits, that without additional data 
computes the string x and halts. Formally, we write 


K(x) = min[U(y) = z], (7) 


where U represents an abstract universal Turing machine or Turing computer. For our 
purposes it suffices to state that a Turing machine is “universal” in that it can imple- 
ment any algorithm and compute any computable function. Kolmogorov complexity 
is a measure of the incompressibility of x, and is analogous to minimal sufficient statis- 
tics, the optimally compressed representation of certain properties of a distribution 
(Chap. ??). 

Consider the following examples. Suppose x consists solely of n 1s. This string 
is actually quite “simple.” If we use some fixed number of bits k to specify a gen- 
eral program containing a loop for printing a string of 1s, we need merely logon 
more bits to specify the iteration number n, the condition for halting. Thus the 
Kolmogorov complexity of a string of n 1s is K(x) = O(logyn). Next consider the 
transcendental number 7, whose infinite sequence of seemingly random binary digits, 
11.00100100001111110110101010001 ...2, actually contains only a few bits of informa- 
tion: the size of the shortest program that can produce any arbitrarily large number 
of consecutive digits of m. Informally we say the algorithmic complexity of m is a 
constant; formally we write K (7) = O(1), which means K (7) does not grow with in- 
creasing number of desired bits. Another example is a “truly” random binary string, 
which cannot be expressed as a shorter string; its algorithmic complexity is within a 
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constant factor of its length. For such a string we write K(x) = O(|x|), which means 
that K(x) grows as fast as the length of x (Problem 13). 


9.2.4 Minimum description length principle 


We now turn to a simple, “naive” version of the minimum description length principle 
and its application to pattern recognition. Given that all members of a category 
share some properties, yet differ in others, the recognizer should seek to learn the 
common or essential characteristics while ignoring the accidental or random ones. 
Kolmogorov complexity seeks to provide an objective measure of simplicity, and thus 
the description of the “essential” characteristics. 

Suppose we seek to design a classifier using a training set D. The minimum 
description length (MDL) principle states that we should minimize the sum of the 
model’s algorithmic complexity and the description of the training data with respect 
to that model, i.e., 


K(h,D) = K(h) + K(D using h). (8) 
Thus we seek the model h* that obeys h* = arg min K(h, D) (Problem 14). (Variations 


on the naive minimum description length principle use a weighted sum of the terms 
in Eq. 8.) In practice, determining the algorithmic complexity of a classifier depends 
upon a chosen class of abstract computers, and this means the complexity can be 
specified only up to an additive constant. 

A particularly clear application of the minimum description length principle is in 
the design of decision tree classifiers (Chap. ??). In this case, a model h specifies the 
tree and the decisions at the nodes; thus the algorithmic complexity of the model is 
proportional to the number of nodes. The complexity of the data given the model 
could be expressed in terms of the entropy (in bits) of the data D, the weighted sum 
of the entropies of the data at the leaf nodes. Thus if the tree is pruned based on 
an entropy criterion, there is an implicit global cost criterion that is equivalent to 
minimizing a measure of the general form in Eq. 8 (Computer exercise 1). 

It can be shown theoretically that classifiers designed with a minimum description 
length principle are guaranteed to converge to the ideal or true model in the limit 
of more and more data. This is surely a very desirable property. However, such 
derivations cannot prove that the principle leads to superior performance in the finite 
data case; to do so would violate the No Free Lunch Theorems. Moreover, in practice 
it is often difficult to compute the minimum description length, since we may not 
be clever enough to find the “best” representation (Problem 17). Assume there is 
some correspondence between a particular classifier and an abstract computer; in 
such a case it may be quite simple to determine the length of the string y necessary 
to create the classifier. But since finding the algorithmic complexity demands we 
find the shortest such string, we must perform a very difficult search through possible 
programs that could generate the classifier. 

The minimum description length principle can be viewed from a Bayesian per- 
spective. Using our current terminology, Bayes formula states 


P(h)P(D|h) 
PD) (9) 


for discrete hypotheses and data. The optimal hypothesis h* is the one yielding the 
highest posterior probability, i.e., 


P(h|D) = 
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h* = arg max[P(h) P(D|h)] 
= arg maxlog2P(h) + log2P(D|h)], (10) 


much as we saw in Chap. ??. We note that a string x can be communicated or repre- 
sented at a cost bounded below by —log, P(x), as stated in Shannon’s optimal coding 
theorem. Shannon’s theorem thus provides a link between the minimum description 
length (Eq. 8) and the Bayesian approaches (Eq. 10). The minimum description 
length principle states that simple models (small K(h)) are to be preferred, and thus 
amounts to a bias toward “simplicity.” It is often easier in practice to specify such a 
prior in terms of a description length than it is using functions of distributions (Prob- 
lem 16). We shall revisit the issue of the tradeoff between simplifying the model and 
fitting the data in the bias-variance dilemma in Sec. 9.3. 

It is found empirically that classifiers designed using the minimum description 
length principle work well in many problems. As mentioned, the principle is effectively 
a method for biasing priors over models toward “simple” models. The reasons for the 
many empirical success of the principle are not trivial, as we shall see in Sect. 9.2.5. 
One of the greatest benefits of the principle is that it provides a computationally 
clear approach to balancing model complexity and the fit of the data. In somewhat 
more heuristic methods, such as pruning neural networks, it is difficult to compare 
the algorithmic complexity of the network (e.g., number of units or weights) with the 
entropy of the data with respect to that model. 


9.2.5 Overfitting avoidance and Occam’s razor 


Throughout our discussions of pattern classifiers, we have mentioned the need to avoid 
overfitting by means of regularization, pruning, inclusion of penalty terms, minimizing 
a description length, and so on. The No Free Lunch results throw such techniques 
into question. If there are no problem-independent reasons to prefer one algorithm 
over another, why is overfitting avoidance nearly universally advocated? For a given 
training error, why do we generally advocate simple classifiers with fewer features and 
parameters? 

In fact, techniques for avoiding overfitting or minimizing description length are 
not inherently beneficial; instead, such techniques amount to a preference, or “bias,” 
over the forms or parameters of classifiers. They are only beneficial if they happen 
to address problems for which they work. It is the match of the learning algorithm 
to the problem — not the imposition of overfitting avoidance — that determines the 
empirical success. There are problems for which overfitting avoidance actually leads 
to worse performance. The effects of overfitting avoidance depend upon the choice of 
representation too; if the feature space is mapped to a new, formally equivalent one, 
overfitting avoidance has different effects (Computer exercise ??). 

In light of the negative results from the No Free Lunch theorems, we might probe 
more deeply into the frequent empirical “successes” of the minimum description length 
principle and the more general philosophical principle of Occam’s razor. In its original 
form, Occam’s razor stated merely that “entities” (or explanations) should not be 
multiplied beyond necessity, but it has come to be interpreted in pattern recognition 
as counselling that one should not use classifiers that are more complicated than are 
necessary, where “necessary” is determined by the quality of fit to the training data. 
Given the respective requisite assumptions, the No Free Lunch theorem proves that 
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there is no benefit in “simple” classifiers (or “complex” ones, for that matter) — 
simple classifiers claim neither unique nor universal validity. 

The frequent empirical “successes” of Occam’s razor imply that the classes of 
problems addressed so far have certain properties. What might be the reason we 
explore problems that tend to favor simpler classifiers? A reasonable hypothesis is that 
through evolution, we have had strong selection pressure on our pattern recognition 
apparatuses to be computationally simple — require fewer neurons, less time, and 
so forth — and in general such classifiers tend to be “simple.” We are more likely 
to ignore problems for which Occam’s razor does not hold. Analogously, researchers 
naturally develop simple algorithms before more complex ones, as for instance in 
the progression from the Perceptron, to multilayer neural networks, to networks with 
pruning, to networks with topology learning, to hybrid neural net /rule-based methods, 
and so on — each more complex than its predecessor. Each method is found to 
work on some problems, but not ones that are “too complex.” For instance the 
basic Perceptron is inadequate for optical character recognition; a simple three-layer 
neural network is inadequate for speaker-independent speech recognition. Hence our 
design methodology itself imposes a bias toward “simple” classifiers; we generally 
stop searching for a design when the classifier is “good enough.” This principle of 
satisficing — creating an adequate though possibly non-optimal solution — underlies 
much of practical pattern recognition as well as human cognition. 

Another “justification” for Occam’s razor derives from a property we might strongly 
desire or expect in a learning algorithm. If we assume that adding more training data 
does not, on average, degrade the generalization accuracy of a classifier, then a version 
of Occam’s razor can in fact be derived. Note, however, that such a desired property 
amounts to a non-uniform prior over learning algorithms — while this property is 
surely desirable, it is a premise and cannot be “proven.” Finally, the No Free Lunch 
theorem implies that we cannot use training data to create a scheme by which we can 
with some assurance distinguish new problems for which the classifier will generalize 
well from new problems for which the classifier will generalize poorly (Problem 8). 


9.3 Bias and variance 


Given that there is no general best classifier unless the probability over the class of 
problems is restricted, practitioners must be prepared to explore a number of methods 
or models when solving any given classification problem. Below we will define two ways 
to measure the “match” or “alignment” of the learning algorithm to the classification 
problem: the bias and the variance. The bias measures the accuracy or quality of 
the match: high bias implies a poor match. The variance measures the precision or 
specificity of the match: a high variance implies a weak match. Designers can adjust 
the bias and variance of classifiers, but the important bias-variance relation shows 
that the two terms are not independent; in fact, for a given mean-square error, they 
obey a form of “conservation law.” Naturally, though, with prior information or even 
mere luck, classifiers can be created that have a different mean-square error. 


9.3.1 Bias and variance for regression 


Bias and variance are most easily understood in the context of regression or curve 
fitting. Suppose there is a true (but unknown) function F(x) with continuous valued 
output with noise, and we seek to estimate it based on n samples in a set D generated 


SATISFICING 


BIAS 


VARIANCE 


BIAS- 
VARIANCE 
DILEMMA 


16 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING 


by F(x). The regression function estimated is denoted g(x; D) and we are interested 
in the dependence of this approximation on the training set D. Due to random 
variations in data selection, for some data sets of finite size this approximation will 
be excellent while for other data sets of the same size the approximation will be poor. 
The natural measure of the effectiveness of the estimator can be expressed as its 
mean-square deviation from the desired optimal. Thus we average over all training 
sets D of fixed size n and find (Problem 18) 


Ep [(g(x; D) — F(x))?] 


= (Eplg(x; D) — F(x)])? + Ep [(g(x; D) — Eplg(x; D)])’] . (11) 
bias? variance 


The first term on the right hand side is the bias (squared) — the difference between 
the expected value and the true (but generally unknown) value — while the second 
term is the variance. Thus a low bias means on average we accurately estimate F 
from D. Further, a low variance means the estimate of F does not change much as the 
training set varies. Even if an estimator is unbiased (i.e., the bias = 0 and its expected 
value is equal to the true value), there can nevertheless be a large mean-square error 
arising from a large variance term. 

Equation 11 shows that the mean-square error can be expressed as the sum of a bias 
and a variance term. The bias-variance dilemma or bias-variance trade-off is a general 
phenomenon: procedures with increased flexibility to adapt to the training data (e.g., 
have more free parameters) tend to have lower bias but higher variance. Different 
classes of regression functions g(x; D) — linear, quadratic, sum of Gaussians, etc. — 
will have different overall errors; nevertheless, Eq. 11 will be obeyed. 

Suppose for example that the true, target function F(x) is a cubic polynomial of 
one variable, with noise, as illustrated in Fig. 9.4. We seek to estimate this function 
based on a sampled training set D. Column a) at the left, shows a very poor “estimate” 
g(a) — a fixed linear function, independent of the training data. For different training 
sets sampled from F(x) with noise, g(x) is unchanged. The histogram of this mean- 
square error of Eq. 11, shown at the bottom, reveals a spike at a fairly high error; 
because this estimate is so poor, it has a high bias. Further, the variance of the 
constant model g(x) is zero. The model in column b) is also fixed, but happens to be 
a better estimate of F(x). It too has zero variance, but a lower bias than the poor 
model in a). Presumably the designer imposed some prior knowledge about F(x) in 
order to get this improved estimate. 

The model in column c) is a cubic with trainable coefficients; it would learn F(x) 
exactly if D contained infinitely many training points. Notice the fit found for every 
training set is quite good. Thus the bias is low, as shown in the histogram at the 
bottom. The model in d) is linear in x, but its slope and intercept are determined 
from the training data. As such, the model in d) has a lower bias than the models in 
a) and b). 

In sum, for a given target function F(x), if a model has many parameters (generally 
low bias), it will fit the data well but yield high variance. Conversely, if the model 
has few parameters (generally high bias), it may not fit the data particularly well, 
but this fit will not change much as for different data sets (low variance). The best 
way to get low bias and low variance is the have prior information about the target 
function. We can virtually never get zero bias and zero variance; to do so would mean 
there is only one learning problem to be solved, in which case the answer is already 
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Figure 9.4: The bias-variance dilemma can be illustrated in the domain of regression. 
Each column represents a different model, each row a different set of n = 6 training 
points, D;, randomly sampled from the true function F(x) with noise. Histograms of 
the mean-square error of E = Ep|(g(x) — F(x))?] of Eq. 11 are shown at the bottom. 
Column a) shows a very poor model: a linear g(x) whose parameters are held fixed, 
independent of the training data. This model has high bias and zero variance. Column 
b) shows a somewhat better model, though it too is held fixed, independent of the 
training data. It has a lower bias than in a) and the same zero variance. Column 
c) shows a cubic model, where the parameters are trained to best fit the training 
samples in a mean-square error sense. This model has low bias, and a moderate 
variance. Column d) shows a linear model that is adjusted to fit each training set; 
this model has intermediate bias and variance. If these models were instead trained 
with a very large number n — oo of points, the bias in c) would approach a small 
value (which depends upon the noise), while the bias in d) would not; the variance of 
all models would approach zero. 
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known. Furthermore, a large amount of training data will yield improved performance 
so long as the model is sufficiently general to represent the target function. These 
considerations of bias and variance help to clarify the reasons we seek to have as much 
accurate prior information about the form of the solution, and as large a training set 
as feasible; the match of the algorithm to the problem is crucial. 


9.3.2 Bias and variance for classification 


While the bias-variance decomposition and dilemma are simplest to understand in 
the case of regression, we are most interested in their relevance to classification; here 
there are a few complications. In a two-category classification problem we let the 
target (discriminant) function have value 0 or +1, i.e., 


F(x) = Prly = 1|x] = 1 — Prly = 0[x]. (12) 


On first consideration, the mean-square error we saw for regression (Eq. 11) does not 
appear to be the proper one for classification. After all, even if the mean-square error 
fit is poor, we can have accurate classification, possibly even the lowest (Bayes) error. 
This is because the decision rule under a zero-one loss selects the higher posterior 
P(w;|x), regardless the amount by which it is higher. Nevertheless by considering the 
expected value of y, we can recast classification into the framework of regression we 
saw before. To do so, we consider a discriminant function 


y = F(x) +6, (13) 


where e is a zero-mean, random variable, for simplicity here assumed to be a centered 
binomial distribution with variance Var|e|x] = F(x)(1 — F(x)). The target function 
can thus be expressed as 


F(x) = €lylx), (14) 


and now the goal is to find an estimate g(x; D) which minimizes a mean-square error, 
such as in Eq. 11: 


Epl(g(x; D) — y)”. (15) 


In this way the regression methods of Sect. 9.3.1 can yield an estimate g(x; D) used 
for classification. 

For simplicity we assume equal priors, P(w1) = P(w2) = 0.5, and thus the Bayes 
discriminant yg has threshold 1/2 and the Bayes decision boundary is the set of points 
for which F(x) = 1/2. For a given training set D, if the classification error rate 
Pr[g(x; D) = y] averaged over predictions at x agrees with the Bayes discriminant, 


Prig(x; D) = y] = Prlyg(x) 4 y] = min[F (x), 1 — F(x)], (16) 


then indeed we have the lowest error. If not, then the prediction yields an increased 
error 


Prig(x; DJ] = max{F(x),1— F(x)] (17) 
= |2F(x) —1|+Priya(x) = y]. 


We average over all data sets of size n and find 
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Prig(x; D) # y] = |2F (x) — 1|Prig(x; D) 4 yB] + Prlye A y]. (18) 


Equation 18 shows that classification error rate is linearly proportional to Pr[g(x; D) 4 yB], 


which can be considered a boundary error in that it represents the mis-estimation of 
the optimal (Bayes) boundary (Problem 19). 

Because of random variations in training sets, the boundary error will depend 
upon p(g(x; D)), the probability density of obtaining a particular estimate of the 
discriminant given D. This error is merely the area of the tail of p(g(x; D)) on 
the opposite side of the Bayes discriminant value 1/2, much as we saw in Chap. ??, 
Fig. ??: 


$ tal x; D))dg if F(x) < 1/2 
1/2 

Prig(x; D) 4 ys] = a (19) 
A p(g(x; D))dg if F(x) > 1/2. 


If we make the natural assumption that p(g(x; D)) is a Gaussian, we find (Problem 20) 


Prig(x; D) # yB] = ®|sgn[F(x) — 1/2] (20) 


Eplg(x; D)] — J 
Var|g(x; D)] 


Dl sonlF(x) — 1/2fEvlg(; D)] — 1/2] Varlg(x; DI], 


boundary bias variance 


where 


öjt] = zj -1/2 du = 1 — erflt] (21) 


Von 


and erf[-] is the familiar error function (App. ??). 

We have expressed this boundary error in terms of a boundary bias, in analogy 
with the simple bias-variance relation in regression (Eq. 11). Equation 20 shows that 
the effect of the variance term on the boundary error is highly nonlinear and depends 
on the value of the boundary bias. Further, when the variance is small, this effect 
is particularly sensitive to the sign of the bias. In regression the estimation error 
is additive in bias? and variance, whereas for classification there is a nonlinear and 
multiplicative interaction. In classification the sign of the boundarybias affects the 
role of variance in the error. For this reason low variance is generally important for 
accurate classification while low boundarybias need not be. Or said another way, in 
classification, variance generally dominates bias. In practical terms, this implies we 
need not be particularly concerned if our estimation is biased, so long as the variance 
is kept low. Numerous specific methods of classifier adjustment — pruning neural 
networks or decision trees, varying the number of free parameters, etc. — affect the 
bias and variance of a classifier; in Sect. 9.5 we shall discuss some methods applicable 
to a broad range of classification methods. Much as we saw in the bias-variance 
dilemma for regression, classification procedures with increased flexibility to adapt to 
the training data (e.g., have more free parameters) tend to have lower bias but higher 
variance. 
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As an illustration of boundarybias and variance in classifiers, consider a simple 
two-class problem in which samples are drawn from two-dimensional Gaussian distri- 
butions, each parameterized by vectors p(x|w;) ~ N(p,, Xi), for i = 1,2. Here the 
true distributions have diagonal covariances, as shown at the top of Fig. 9.5. We have 
just a few samples from each category and estimate the parameters in three different 
classes of models by maximum likelihood. Column a) at the left shows the most gen- 
eral Gaussian classifiers; each component distribution can have arbitrary covariance 
matrix. Column b) at the middle shows classifiers where each component Gaussian 
is constrained to have a diagonal covariance. Column c) at the right shows the most 
restrictive model: the covariances are equal to the identity matrix, yielding circular 
Gaussian distributions. Thus the left column corresponds to very low bias, and the 
right column to high bias. 

Each row in Fig. 9.5 represents a different training set, randomly selected from 
the true distribution (shown at the top), and the resulting classifiers. Notice that 
most feature points in the high bias cases retain their classification, regardless of the 
particular training set (i.e., such models have low variance), whereas the classification 
of a much larger range of points varies in the low bias case (i.e., there is high variance). 
While in general a lower bias comes at the expense of higher variance, the relationship 
is nonlinear and multiplicative. 

At the bottom of the figure, three density plots show how the location of the 
decision boundary varies across many different training sets. The left-most density 
plot shows a very broad distribution (high variance). The right-most plot shows a 
narrow, peaked distribution (low variance). To visualize the bias, imagine taking the 
spatial average of the decision boundaries obtained by running the learning algorithm 
on all possible data sets. The average of such boundaries for the left-most algorithm 
will be equal to the true decision boundary — this algorithm has no bias. The right- 
most average will be a vertical line, and hence there will be higher error — this 
algorithm has the highest bias of the three. Histograms of the generalization error 
are shown along the bottom. 

For a given bias, the variance will decrease as n is increased. Naturally, if we 
had trained using a very large training set (n — oo), all error histograms become 
narrower and move to lower values of E. If a model is rich enough to express the 
optimal decision boundary, its error histogram for the large n case will approach a 
delta function at E = Ep, the Bayes error. 

As mentioned, to achieve the desired low generalization error it is more important 
to have low variance than to have low bias. The only way to get the ideal of zero bias 
and zero variance is to know the true model ahead of time (or be astoundingly lucky 
and guess it), in which case no learning was needed anyway. Bias and variance can be 
lowered with large training size n and accurate prior knowledge of the form of F(x). 
Further, as n grows, more parameters must be added to the model, g, so the data 
can be fit (reducing bias). For best classification based on a finite training set, it is 
desirable to match the form of the model to that of the (unknown) true distributions; 
this usually requires prior knowledge. 


9.4 *Resampling for estimating statistics 
When we apply some learning algorithm to a new pattern recognition problem with 


unknown distribution, how can we determine the bias and variance? Figures 9.4 dz 
9.5 suggest a method using multiple samples, an inspiration for formal “resampling” 


9.4. *RESAMPLING FOR ESTIMATING STATISTICS 21 


N 
\ 


` O truth 
1 
a) c) 
y ETA y 5; 0 10 
iN Gin, 0% i~\ 0 Oj 2, -= 01 
low Bias high 
ee 
*2: X3 X3 
4 R \ 4 4 
S A Zg l 
`e e ° 
o o o 
D ke P R G 
N / 
Vos / \ 
Je 
\ 4 | 
| 


DPB PS IS TG! 


. . . . 
. . . . 
. e . 


boundary 
distributions 


error 
histograms 


Ep E Ej E E 


A 
high Variance low 


Figure 9.5: The (boundary) bias-variance tradeoff in classification can be illustrated 
with a two-dimensional Gaussian problem. The figure at the top shows the (true) 
decision boundary of the Bayes classifier. The nine figures in the middle show nine 
different learned decision boundaries. Each row corresponds to a different training set 
of n = 8 points selected randomly from the true distributions and labeled according 
to the true decision boundary. Column a) shows the decision boundaries learning 
by fitting a Gaussian model with fully general covariance matrices by maximum like- 
lihood. The learned boundaries differ significantly from one data set to the next; 
this learning algorithm has high variance. Column b) shows the decision boundaries 
resulting from fitting a Gaussian model with diagonal covariances; in this case the 
decision boundaries vary less from one row to another. This learning algorithm has a 
lower variance than the one at the left. Finally, column c) at the right shows decision 
boundaries learning by fitting a Gaussian model with unit covariances (i.e., a linear 
model); notice that the decision boundaries are nearly identical from one data set to 
the next. This algorithm has low variance. 
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methods, which we now discuss. Later we shall turn to our ultimate goal: using 
resampling and related techniques to improve classification (Sect. 9.5). 


9.4.1 Jackknife 


We begin with an example of how resampling can be used to yield a more informative 
estimate of a general statistic. Suppose we have a set D of n data points x; (i = 
1,...,n), sampled from a one-dimensional distribution. The familiar estimate of the 
mean is, of course, 


ñ= es (22) 


i=1 
Likewise the estimate of the accuracy of the mean is the standard deviation, given by 


e 1 n a 
a= ma-i 2 Êy. (23) 
Suppose we were instead interested in the median, the point for which half of the 
distribution is higher, half lower. Although we could determine the median explic- 
itly, there does not seem to be a straightforward way to generalize Eq. 23 to give a 
measure of the error of our estimate of the median. The same difficulty applies to 
estimating the mode (the most frequently represented point in a data set), the 25th 
percentile, or any of a large number of statistics other than the mean. The jackknife* 
and bootstrap (Sect. 9.4.2) are two of the most popular and theoretically grounded 
resampling techniques for extending the above approach (based on Eqs. 22 & 23) to 
arbitrary statistics, of which the mean is just one instance. 
In resampling theory, we frequently use statistics in which a data point is elimi- 
nated from the data; we denote this by means of a special subscript. For instance, 
the leave-one-out mean is 


1 nT — Ti 
== > (24) 
¡Ai 
i.e., the sample average of the data set if the ith point is deleted. Next we define the 

jackknife estimate of the mean to be 


1 n 
HO =D Me (25) 
i=l 


that is, the mean of the leave-one-out means. It is simple to prove that the traditional 
estimate of the mean and the jackknife estimate of the mean are the same, i.e., Å = u(.) 
(Problem 23). Likewise, the jackknife estimate of the variance of the estimate obeys 


n 


Varli] = "=" co — He)”, (26) 


n z 
i=1 


and, applied to the mean, is equivalent to the traditional variance of Eq. 23 (Problem 
26). 


* The jackknife method, which also goes by the name of “leave one out,” was due to Maurice 
Quenouille. The playful name was chosen by John W. Tukey to capture the impression that the 
method was handy and useful in lots of ways. 
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The benefit of expressing the variance in the form of Eq. 26 is that it can be 
generalized to any other estimator 6, such as the median or 25th percentile or mode, 
... To do so we need to compute the statistic with one data point “left out.” Thus 
we let 


Oi) = Ê(x£1, £2,- Lit, Zi41)*"> Ln) (27) 


take the place of u), and let 78) take the place of pi.) in Eqs. 25 & 26 above. 


Jackknife bias estimate 


The notion of bias is more general that that described in Sect. 9.3; in fact it can be 
applied to the estimation of any statistic. The bias of an estimator 0 is the difference 
between its true value and its expected value, i.e., 
bias = 0 — € [0]. (28) 
The jackknife method can be used estimate such a bias. The procedure is first to 
sequentially delete points x; one at a time from D and compute the estimate 0;.,. 
Then the jackknife estimate of the bias is (Problem 21) 
bias jack = (n — 1)(6(.) — Ê). (29) 


We rearrange terms and thus see that the jackknife estimate of 0 itself is 


6 = 6 — biasjack = nô — (n — 1)6(.. (30) 


The benefit of using Eq. 30 is that it is a quadratic function, unbiased for estimating 
the true bias (Problem 25). 


Jackknife variance estimate 


Now we seek the jackknife estimate of the variance of an arbitrary statistic 0. First, 
recall that the traditional variance is defined as: 


Var[6] = €[6(21, £2,- , £n) — E[4]]?. (31) 
The jackknife estimate of the variance, defined by analogy to Eq. 26, is: 


Y m= x A 
Val jack [0] = SÓ) — ĝl’, (32) 


n : 
11 


where as before T = ĝi). 
i=1 


~ n 


Example 2: Jackknife estimate of bias and variance of the mode | 


Consider an elementary example where we are interested in the mode of the fol- 
lowing n = 6 points: D = {0, 10, 10, 10,20,20}. It is clear from the histogram that 
the most frequently represented point is 9 = 10. The jackknife estimate of the mode 
is 


BIAS 
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P ts 1 
ĝa =-=) ĝo = z[10 +15 + 15 + 15 + 10 + 10] = 12.5, 
i=l 


where for 7 = 2,3,4 we used the fact that the mode of a distribution having two 
equal peaks is the point midway between those peaks. The fact that 6(.) > 0 reveals 
immediately that the jackknife estimate takes into account more of the full (skewed) 
distribution than does the standard mode calculation. 
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A histogram of n = 6 points whose mode is 6 = 10 and jackknife estimate of the mode 
is 6.) = 12.5. The square root of the jackknife estimate of the variance is a natural 
measure of the range of probable values of the mode. This range is indicated by the 
horizontal red bar. 


The jackknife estimate of the bias of the estimate of the mode is given by Eq. 29: 


biasjack = (n — 1)(6(.) — Ô) = 5(12.5 — 10) = 12.5. 


Likewise, the jackknife estimate of the variance is given by Eq. 32: 


n 


A n—1 R R 
Varjack[] = S ôo — ôo)? 


n F 
i=1 


[(10 — 12.5)? + 3(15 — 12.5)? + 2(10 — 12.5)?] = 31.25. 


The square root of this variance, y31.25 ~ 5.6, serves as an effective standard de- 
viation. A red bar of twice this width, shown below the histogram, reveals that the 
traditional mode lies within this tolerance to the jackknife estimate of the mode. 


The jackknife resampling technique often gives us a more satisfactory estimate of 
a statistic such as the mode than do traditional methods though it is more computa- 
tionally complex (Problem 27). 


9.4.2 Bootstrap 


A “bootstrap” data set is one created by randomly selecting n points from the training 
set D, with replacement. (Since D itself contains n points, there is nearly always 
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duplication of individual points in a bootstrap data set.) In bootstrap estimation,* 
this selection process is independently repeated B times to yield B bootstrap data 
sets, which are treated as independent sets. The bootstrap estimate of a statistic 0, 
denoted eo, is merely the mean of the B estimates on the individual bootstrap data 
sets: 


boy 
== Se, (33) 
b=1 


where 6*() is the estimate on bootstrap sample b. 


Bootstrap bias estimate 


The bootstrap estimate of the bias is (Problem ??) 


B 
1 a AO A 7 
; ae *(0) _ ĝ— ĝe) 
biaspoot = B 2, 0 0=0 0. (34) 


Computer exercise 45 shows how the bootstrap can be applied to statistics that resist 
computational analysis, such as the “trimmed mean,” in which the mean is calculated 
for a distribution in which some percentage (e.g., 5%) of the high and the low points 
in a distribution have been eliminated. 


Bootstrap variance estimate 


The bootstrap estimate of the variance is 


B 
Varpoot [0] = Sy [ao - pof. (35) 
b=1 


If the statistic 0 is the mean, then in the limit of B — oo, the bootstrap estimate of 
the variance is the traditional variance of the mean (Problem 22). Generally speaking, 
the larger the number B of bootstrap samples, the more satisfactory is the estimate 
of a statistic and its variance. One of the benefits of bootstrap estimation is that B 
can be adjusted to the computational resources; if powerful computers are available 
for a long time, then B can be chosen large. In contrast, a jackknife estimate requires 
exactly n repetitions: fewer repetitions gives a poorer estimate that depends upon 
the random points chosen; more repetitions merely duplicates information already 
provided by some of the first n leave-one-out calculations. 


9.5 Resampling for classifier design 


The previous section addressed the use of resampling in estimating statistics, including 
the accuracy of an existing classifier, but only indirectly referred to the design of 
classifiers themselves. We now turn to a number of general resampling methods that 
have proven effective when used in conjunction with any in a wide range of techniques 


* “Bootstrap” comes from Rudolf Erich Raspe’s wonderful stories “The adventures of Baron Munch- 
hausen,” in which the hero could pull himself up onto his horse by lifting his own bootstraps. A 
different but more common usage of the term applies to starting a computer, which must first run 
a program before it can run other programs. 
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for training classifiers. These are related to methods for estimating and comparing 
classifier models that we will discuss in Sect. 9.6. 


9.5.1 Bagging 


The generic term arcing — adaptive reweighting and combining — refers to reusing 
or selecting data in order to improve classification. In Sect. 9.5.2 we shall consider 
the most popular arcing procedure, AdaBoost, but first we discuss briefly one of the 
simplest. Bagging — a name derived from “bootstrap aggregation” — uses multiple 
versions of a training set, each created by drawing n’ < n samples from D with 
replacement. Each of these bootstrap data sets is used to train a different component 
classifier and the final classification decision is based on the vote of each component 
classifier.* Traditionally the component classifiers are of the same general form — 
i.e., all hidden Markov models, or all neural networks, or all decision trees — merely 
the final parameter values differ among them due to their different sets of training 
patterns. 

A classifier /learning algorithm combination is informally called unstable if “small” 
changes in the training data lead to significantly different classifiers and relatively 
“large” changes in accuracy. As we saw in Chap. ??, decision tree classifiers trained 
by a greedy algorithm can be unstable — a slight change in the position of a single 
training point can lead to a radically different tree. In general, bagging improves 
recognition for unstable classifiers since it effectively averages over such discontinuities. 
There are no convincing theoretical derivations or simulation studies showing that 
bagging will help all stable classifiers, however. 

Bagging is our first encounter with multiclassifier systems, where a final overall 
classifier is based on the outputs of a number of component classifiers. The global de- 
cision rule in bagging — a simple vote among the component classifiers — is the most 
elementary method of pooling or integrating the outputs of the component classifiers. 
We shall consider multiclassifier systems again in Sect. 9.7, with particular attention 
to forming a single decision rule from the outputs of the component classifiers. 


9.5.2 Boosting 


The goal of boosting is to improve the accuracy of any given learning algorithm. In 
boosting we first create a classifier with accuracy on the training set greater than 
average, and then add new component classifiers to form an ensemble whose joint 
decision rule has arbitrarily high accuracy on the training set. In such a case we say 
the classification performance has been “boosted.” In overview, the technique trains 
successive component classifiers with a subset of the training data that is “most 
informative” given the current set of component classifiers. Classification of a test 
point x is based on the outputs of the component classifiers, as we shall see. 

For definiteness, consider creating three component classifiers for a two-category 
problem through boosting. First we randomly select a set of ny < n patterns from 
the full training set D (without replacement); call this set D,. Then we train the first 
classifier, C1, with Dı. Classifier C¡ need only be a weak learner, i.e., have accuracy 
only slightly better than chance. (Of course, this is the minimum requirement; a 
weak learner could have high accuracy on the training set. In that case the benefit 


* In Sect. 9.7 we shall come across other names for component classifiers. For the present purposes 
we simply note that these are not classifiers of component features, but are instead members in an 
ensemble of classifiers whose outputs are pooled so as to implement a single classification rule. 
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of boosting will be small.) Now we seek a second training set, D2, that is the “most 


informative” given component classifier C1. Specifically, half of the patterns in Da 
should be correctly classified by C41, half incorrectly classified by Cı (Problem 29). 
Such an informative set D2 is created as follows: we flip a fair coin. If the coin is 
heads, we select remaining samples from D and present them, one by one to Cy until 
Cı misclassifies a pattern. We add this misclassified pattern to Dz. Next we flip the 
coin again. If heads, we continue through D to find another pattern misclassified by 
Cı and add it to Da as just described; if tails we find a pattern which C4 classifies 
correctly. We continue until no more patterns can be added in this manner. Thus 
half of the patterns in Də are correctly classified by C1, half are not. As such Da 
provides information complementary to that represented in C1. Now we train a second 
component classifier C2 with Də. 

Next we seek a third data set, D3, which is not well classified by the combined 
system Cı and C2. We randomly select a training pattern from those remaining 
in D, and classify that pattern with Cı and with Cy. If Cı and Ch disagree, we 
add this pattern to the third training set D3; otherwise we ignore the pattern. We 
continue adding informative patterns to D3 in this way; thus D3 contains those not 
well represented by the combined decisions of C and C2. Finally, we train the last 
component classifier, C3, with the patterns in D3. 

Now consider the use of the ensemble of three trained component classifiers for 
classifying a test pattern x. Classification is based on the votes of the component 
classifiers. Specifically, if Cı and C2 agree on the category label of x, we use that 
label; if they disagree, then we use the label given by C3 (Fig. 9.6). 

We skipped over a practical detail in the boosting algorithm: how to choose the 
number of patterns nı to train the first component classifier. We would like the 
final system to be trained with all patterns in D of course; moreover, because the 
final decision is a simple vote among the component classifiers, we would like to have 
roughly equal number of patterns in each (i.e., nı ~ Na ~ ng ~ n/3). A reasonable 
first guess is to set ni ~ n/3 and create the three component classifiers. If the 
classification problem is very simple, however, component classifier C4 will explain 
most of the data and thus nz (and nz) will be much less than nı, and not all of the 
patterns in the training set D will be used. Conversely, if the problem is extremely 
difficult, then C will explain but little of the data, and nearly all the patterns will be 
informative with respect to C1; thus na will be unacceptably large. Thus in practice 
we may need to run the overall boosting procedure a few times, adjusting ny in order 
to use the full training set and, if possible, get roughly equal partitions of the training 
set. A number of simple heuristics can be used to improve the partitioning of the 
training set as well (Computer exercise ?7). 

The above boosting procedure can be applied recursively to the component clas- 
sifiers themselves, giving a 9-component or even 27-component full classifier. In this 
way, a very low training error can be achieved, even a vanishing training error if the 
problem is separable. 


AdaBoost 


There are a number of variations on basic boosting. The most popular, AdaBoost 
— from “adaptive” boosting — allows the designer to continue adding weak learners 
until some desired low training error has been achieved. In AdaBoost each training 
pattern receives a weight which determines its probability of being selected for a 
training set for an individual component classifier. If a training pattern is accurately 
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Figure 9.6: A two-dimensional two-category classification task is shown at the top. 
The middle row shows three component (linear) classifiers Cy trained by LMS al- 
gorithm (Chap. ??), where their training patterns were chosen through the basic 
boosting procedure. The final classification is given by the voting of the three com- 
ponent classifiers, and yields a nonlinear decision boundary, as shown at the bottom. 
Given that the component classifiers are weak learners (i.e., each can learn a training 
set better than chance), then the ensemble classifier will have a lower training error 
on the full training set D than does any single component classifier. 


classified, then its chance of being used again in a subsequent component classifier 
is reduced; conversely, if the pattern is not accurately classified, then its chance of 
being used again is raised. In this way, AdaBoost “focuses in” on the informative or 
“difficult” patterns. Specifically, we initialize these weights across the training set to 
to be uniform. On each iteration k, we draw a training set at random according to 
these weights, and train component classifier C on the patterns selected. Next we 
increase weights of training patterns misclassified by Ck and decrease weights of the 
patterns correctly classified by Ck. Patterns chosen according to this new distribution 
are used to train the next classifier, Ck+}1, and the process is iterated. 

We let the patterns and their labels in D be denoted x’ and y;, respectively and let 
Wp (i) be the kth (discrete) distribution over all these training samples. The AdaBoost 
procedure is then: 


Algorithm 1 (AdaBoost) 


1 begin initialize D = {x!, Y1, x’, yo,...,*", Yn}, kmar, Wi(1) = 1/n,i=1,...,n 

2 k=0 

3 do k=k+1 

4 Train weak learner C using D sampled according to distribution Wx (i) 
5 Ey — Training error of Cy measured on D using Wy(i) 

6 Ak — $In{(1 = Ex) /E«| 
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We (i) e~°* if hy(x") = y; (correctly classified) 
Wipala) a { eer if h(x’) Æ yi (incorrectly classified) 
8 until k = kmax 
9 return Ck and a, for k = 1 to kmax (ensemble of classifiers with weights) 
10 end 


Note that in line 5 the error for classifier Cy, is determined with respect to the distri- 
bution W;,(i) over D on which it was trained. In line 7, Z; is simply a normalizing 
constant computed to insure that W;,(i) represents a true distribution, and hy(x") is 
the category label (+1 or -1) given to pattern x’ by component classifier C. Natu- 
rally, the loop termination of line 8 could instead use the criterion of sufficiently low 
training error of the ensemble classifier. 

The final classification decision of a test point x is based on a discriminant function 
that is merely the weighted sums of the outputs given by the component classifiers: 


kmaz 
g(x) = bs out) : (36) 
k=1 


The classification decision for this two-category case is then simply sgn[g(x)]. 

Except in pathological cases, so long as each component classifier is a weak learner, 
the total training error of the ensemble can be made arbitrarily low by setting the 
number of component classifiers, kmax, sufficiently high. To see this, notice that the 
training error for weak learner Cp can be written as Ex = 1/2 — Gx for some positive 
value Gk. Thus the ensemble training error is (Problem 31): 


kmaz kmaw 


E=]] [2 /E,(1= E] II /1-<cz 
k=1 k=1 
kmar 
exp (-2 5 <i) (37) 
k=l 


as illustrated in Fig. 9.7. It is sometimes beneficial to increase kmax beyond the value 
needed for zero ensemble training error since this may improve generalization. While 
a large kmax could in principle lead to overfitting, simulation experiments have shown 
that overfitting rarely occurs, even when kmax is extremely large. 

At first glance, it appears that boosting violates the No Free Lunch Theorem in 
that an ensemble classifier seems always to perform better than any single component 
classifier on the full training set. After all, according to Eq. 37 the training error drops 
exponentially fast with the number of component classifiers. The Theorem is not vi- 
olated, however: boosting only improves classification if the component classifiers 
perform better than chance, but this cannot be guaranteed a priori. If the component 
classifiers cannot learn the task better than chance, then we do not have a strong 
match between the problem and model, and should choose an alternate learning al- 
gorithm. Moreover, the exponential reduction in error on the training set does not 
insure reduction of the off-training set error or generalization, as we saw in Sect. 9.2.1. 
Nevertheless, AdaBoost has proven effective in many real-world applications. 


IA 


9.5.3 Learning with queries 


In the previous sections we assumed there was a set of labeled training patterns D 
and employed resampling methods to reuse patterns to improve classification. In 
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Figure 9.7: AdaBoost applied to a weak learning system can reduce the training error 
E exponentially as the number of component classifiers, kmag, is increased. Because 
AdaBoost “focuses on” difficult training patterns, the training error of each successive 
component classifier (measured on its own weighted training set) is generally larger 
than that of any previous component classifier (shown in gray). Nevertheless, so long 
as the component classifiers perform better than chance (e.g., have error less than 0.5 
on a two-category problem), the weighted ensemble decision of Eq. 36 insures that 
the training error will decrease, as given by Eq. 37. It is often found that the test 
error decreases in boosted systems as well, as shown in red. 


some applications, however, the patterns are unlabeled. We shall return in Chap. ?? 
to the problem of learning when no labels are available but here we assume there 
exists some (possibly costly) way of labeling any pattern. Our current challenge is 
thus to determine which unlabeled patterns would be most informative (i.e., improve 
the classifier the most) if they were labeled and used as training patterns. These 
are the patterns we will present as a query to an oracle — a teacher who can label, 
without error, any pattern. This approach is called variously learning with queries, 
active learning or interactive learning and is a special case of a resampling technique. 

Learning with queries might be appropriate, for example, when we want to de- 
sign a classifier for handwritten numerals using unlabeled pixel images scanned from 
documents from a corpus too large for us to label every pattern. We could start by 
randomly selecting some patterns, presenting them to an oracle, and then training the 
classifier with the returned labels. We then use learning with queries to select unla- 
beled patterns from our set to present to a human (the oracle) for labeling. Informally, 
we would expect the most valuable patterns would be near the decision boundaries. 

More generally we begin with a preliminary, weak classifier that has been developed 
with a small set of labeled samples. There are two related methods for then selecting 
an informative pattern, i.e., a pattern for which the current classifier is least certain. 
In confidence based query selection the classifier computes discriminant functions g;(x) 
for the c categories, i = 1,...,c. An informative pattern x is one for which the two 
largest discriminant functions have nearly the same value; such patterns lie near the 
current decision boundaries. Several search heuristics can be used to find such points 
efficiently (Problem 30). 

The second method, voting based or committee based query selection, is similar to 
the previous method but is applicable to multiclassifier systems, that is, ones compris- 
ing several component classifiers (Sect. 9.7). Each unlabeled pattern is presented to 
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each of the k component classifiers; the pattern that yields the greatest disagreement 
among the k resulting category labels is considered the most informative pattern, and 
is thus presented as a query to the oracle. Voting based query selection can be used 
even if the component classifiers do not provide analog discriminant functions, for 
instance decision trees, rule-based classifiers or simple k-nearest neighbor classifiers. 
In both confidence based and voting based methods, the pattern labeled by the oracle 
is then used for training the classifier in the traditional way. (We shall return in 
Sect. 9.7 to training an ensemble of classifiers.) 

Clearly such learning with queries does not directly exploit information about the 
prior distribution of the patterns. In particular, in most problems the distributions 
of query patterns will be large near the final decision boundaries (where patterns are 
informative) rather than at the region of highest prior probability (where they are 
typically less informative), as illustrated in Fig. 9.8. One benefit of learning with 
queries is that we need not guess the form of the underlying distribution, but can 
instead use non-parametric techniques, such as nearest-neighbor classification, that 
allow the decision boundary to be found directly. 

If there is not a large set of unlabeled samples available for queries, we can nev- 
ertheless exploit learning with queries if there is a way to generate query patterns. 
Suppose we have a only small set of labeled handwritten characters. Suppose too we 
have image processing algorithms for altering these images to generate new, surrogate 
patterns for queries to an oracle. For instance the pixel images might be rotated, 
scaled, sheared, be subject to random pixel noise, or have their lines thinned. Fur- 
ther, we might be able to generate new patterns “in between” two labeled patterns by 
interpolating or somehow mixing them in a domain-specific way. With such generated 
query patterns the classifier can explore regions of the feature space about which it is 
least confident (Fig. 9.8). 


9.5.4 Arcing, learning with queries, bias and variance 


In Chap. ?? and many other places, we have stressed the need for training a classifier 
on samples drawn from the distribution on which it will be tested. Resampling in 
general, and learning with queries in particular, seem to violate this recommendation. 
Why can a classifier trained on a strongly weighted distribution of data be expected 
to do well — or better! — than one trained on the i.i.d. sample? Why doesn’t 
resampling lead to worse performance, to the extent that the resampled distribution 
differs from the i.i.d. one? 

Indeed, if we were to take a model of the true distribution and train it with 
a highly skewed distribution obtained by learning with queries, the final classifier 
accuracy might be unacceptably low. Consider, however, two interrelated points about 
resampling methods and altered distributions. The first is that resampling methods 
are generally used with techniques that do not attempt to model or fit the full category 
distributions. Thus even if we suspect the prior distributions for two categories are 
Gaussian, we might use a non-parametric method such as nearest neighbor, radial 
basis function, or RCE classifiers when using learning with queries. Thus in learning 
with queries we are not fitting parameters in a model, as described in Chap. ??, but 
instead are seeking decision boundaries more directly. 

The second point is that as the number of component classifiers is increased, 
techniques such as general boosting and AdaBoost effectively broaden that class of 
implementable functions, as illustrated in Fig. 9.6. While the final classifier might 
indeed be characterized as parametric, it is in an expanded space of parameters, one 
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Figure 9.8: Active learning can be used to create classifiers that are more accurate 
than ones using i.i.d. sampling. The figure at the top shows a two-dimensional problem 
with two equal circular Gaussian priors; the Bayes decision boundary is a straight line 
and the Bayes error Eg = 0.02275. The bottom figure on the left shows a nearest- 
neighbor classifier trained with n = 30 labeled points sampled i.i.d. from the true 
distributions. Note that most of these points are far from the decision boundary. 
The figure at the right illustrates active learning. The first four points were sampled 
near the extremes of the feature space. Subsequent query points were chosen midway 
between two points already used by the classifier, one randomly selected from each of 
the two categories. In this way, successive queries to the oracle “focused in” on the 
true decision boundary. The final generalization error of this classifier (0.02422) is 
lower than the one trained using i.i.d. samples (0.05001). 


larger than that of the first component classifier. 

In broad overview, resampling, boosting and related procedures are heuristic meth- 
ods for adjusting the class of implementable decision functions. As such they allow 
the designer to try to “match” the final classifier to the problem by indirectly adjust- 
ing the bias and variance. The power of these methods is that they can be used with 
an arbitrary classification technique such as the Perceptron, which would otherwise 
prove extremely difficult to adjust to the complexity of an arbitrary problem. 


9.6 Estimating and comparing classifiers 


There are at least two reasons for wanting to know the generalization rate of a classifier 
on a given problem. One is to see if the classifier performs well enough to be useful; 
another is to compare its performance with that of a competing design. Estimating 
the final generalization performance invariably requires making assumptions about 
the classifier or the problem or both, and can fail if the assumptions are not valid. 
We should stress, then, that all the following methods are heuristic. Indeed, if there 
were a foolproof method for choosing which of two classifiers would generalize better 
on an arbitrary new problem, we could incorporate such a method into the learning 
and violate the No Free Lunch Theorem. Occasionally our assumptions are explicit 
(as in parametric models), but more often than not they are implicit and difficult to 
identify or relate to the final estimation (as in empirical methods). 
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9.6.1 Parametric models 


One approach to estimating the generalization rate is to compute it from the as- 
sumed parametric model. For example, in the two-class multivariate normal case, we 
might estimate the probability of error using the Bhattacharyya or Chernoff bounds 
(Chap ??), substituting estimates of the means and the covariance matrix for the 
unknown parameters. However, there are three problems with this approach. First, 
such an error estimate is often overly optimistic; characteristics that make the training 
samples peculiar or unrepresentative will not be revealed. Second, we should always 
suspect the validity of an assumed parametric model; a performance evaluation based 
on the same model cannot be believed unless the evaluation is unfavorable. Finally, 
in more general situations where the distributions are not simple it is very difficult to 
compute the error rate exactly, even if the probabilistic structure is known completely. 


9.6.2 Cross validation 


In cross validation we randomly split the set of labeled training samples D into two 
parts: one is used as the traditional training set for adjusting model parameters in the 
classifier. The other set — the validation set — is used to estimate the generalization 
error. Since our ultimate goal is low generalization error, we train the classifier until 
we reach a minimum of this validation error, as sketched in Fig. 9.9. It is essential that 
the validation (or the test) set not include points used for training the parameters in 
the classifier — a methodological error known as “testing on the training set.” * 
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Figure 9.9: In cross validation, the data set D is split into two parts. The first (e.g., 
90% of the patterns) is used as a standard training set for setting free parameters in the 
classifier model; the other (e.g., 10%) is the validation set and is meant to represent the 
full generalization task. For most problems, the training error decreases monotonically 
during training, as shown in black. Typically, the error on the validation set decreases, 
but then increases, an indication that the classifier may be overfitting the training 
data. In cross validation, training or parameter adjustment is stopped at the first 
minimum of the validation error. 


Cross validation can be applied to virtually every classification method, where the 
specific form of learning or parameter adjustment depends upon the general training 


* A related but less obvious problem arises when a classifier undergoes a long series of refinements 
guided by the results of repeated testing on the same test data. This form of “training on the test 
data” often escapes attention until new test samples are obtained. 
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method. For example, in neural networks of a fixed topology (Chap. ??), the amount 
of training is the number of epochs or presentations of the training set. Alternatively, 
the number of hidden units can be set via cross validation. Likewise, the width of the 
Gaussian window in Parzen windows (Chap. ??), and an optimal value of k in the 
k-nearest neighbor classifier (Chap. ??) can be set by cross validation. 

Cross validation is heuristic and need not (indeed cannot) give improved classifiers 
in every case. Nevertheless, it is extremely simple and for many real-world problems 
is found to improve generalization accuracy. There are several heuristics for choosing 
the portion y of D to be used as a validation set (0 < y < 1). Nearly always, a 
smaller portion of the data should be used as validation set (y < 0.5) because the 
validation set is used merely to set a single global property of the classifier (i.e., when 
to stop adjusting parameters) rather than the large number of classifier parameters 
learned using the training set. If a classifier has a large number of free parameters 
or degrees of freedom, then a larger portion of D should be used as a training set, 
i.e., y should be reduced. A traditional default is to split the data with y = 0.1, 
which has proven effective in many applications. Finally, when the number of degrees 
of freedom in the classifier is small compared to the number of training points, the 
predicted generalization error is relatively insensitive to the choice of y. 

A simple generalization of the above method is m-fold cross validation. Here the 
training set is randomly divided into m disjoint sets of equal size n/m, where n is 
again the total number of patterns in D. The classifier is trained m times, each 
time with a different set held out as a validation set. The estimated performance is 
the mean of these m errors. In the limit where m = n, the method is in effect the 
leave-one-out approach to be discussed in Sect. 9.6.3. 

We emphasize that cross validation is a heuristic and need not work on every prob- 
lem. Indeed, there are problems for which anti-cross validation is effective — halting 
the adjustment of parameters when the validation error is the first local maximum. 
As such, in any particular problem designers must be prepared to explore different 
values of y, and possibly abandon the use of cross validation altogether if performance 
cannot be improved (Computer exercise 5). 

Cross validation is, at base, an empirical approach that tests the classifier experi- 
mentally. Once we train a classifier using cross validation, the validation error gives 
an estimate of the accuracy of the final classifier on the unknown test set. If the true 
but unknown error rate of the classifier is p, and if k of the n’ independent, randomly 
drawn test samples are misclassified, then k has the binomial distribution 


Pt) = (7 Jota. (38) 


Thus, the fraction of test samples misclassified is exactly the maximum likelihood 
estimate for p (Problem 39): 


pa (39) 


n 

The properties of this estimate for the parameter p of a binomial distribution are 
well known. In particular, Fig. 9.10 shows 95% confidence intervals as a function of 
p and n’. For a given value of p, the probability is 0.95 that the true value of p lies 
in the interval between the lower and upper curves marked by the number n/ of test 
samples (Problem 36). These curves show that unless n’ is fairly large, the maximum 
likelihood estimate must be interpreted with caution. For example, if no errors are 
made on 50 test samples, with probability 0.95 the true error rate is between zero and 
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8%. The classifier would have to make no errors on more than 250 test samples to be 
reasonably sure that the true error rate is below 2%. 


>p 
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Figure 9.10: The 95% confidence intervals for a given estimated error probability p 
can be derived from a binomial distribution of Eq. 38. For each value of p, the true 
probability has a 95% chance of lying between the curves marked by the number of 
test samples n’. The larger the number of test samples, the more precise the estimate 
of the true probability and hence the smaller the 95% confidence interval. 


9.6.3 Jackknife and bootstrap estimation of classification ac- 
curacy 


A method for comparing classifiers closely related to cross validation is to use the 
jackknife or bootstrap estimation procedures (Sects. 9.4.1 & 9.4.2). The application 
of the jackknife approach to classification is straightforward. We estimate the accuracy 
of a given algorithm by training the classifier n separate times, each time using the 
training set D from which a different single training point has been deleted. This is 
merely the m = n limit of m-fold cross validation. Each resulting classifier is tested 
on the single deleted point and the jackknife estimate of the accuracy is then simply 
the mean of these leave-one-out accuracies. Here the computational complexity may 
be very high, especially for large n (Problem 28). 

The jackknife, in particular, generally gives good estimates, since each of the the 
n Classifiers is quite similar to the classifier being tested (differing solely due to a sin- 
gle training point). Likewise, the jackknife estimate of the variance of this estimate 
is given by a simple generalization of Eq. 32. A particular benefit of the jackknife 
approach is that it can provide measures of confidence or statistical significance in the 
comparison between two classifier designs. Suppose trained classifier C has an accu- 
racy of 80% while C2 has accuracy of 85%, as estimated by the jackknife procedure. 
Is Cy really better than C1? To answer this, we calculate the jackknife estimate of 
the variance of the classification accuracies and use traditional hypothesis testing to 
see if C1’s apparent superiority is statistically significant (Fig. 9.11). 

There are several ways to generalize the bootstrap method to the problem of es- 
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Figure 9.11: Jackknife estimation can be used to compare the accuracies of classifiers. 
The jackknife estimate of classifiers Cı and Cz are 80% and 85%, and full widths 
(twice the square root of the jackknife estimate of the variances) are 12% and 15%, 
as shown by the bars at the bottom. In this case, traditional hypothesis testing could 
show that the difference is not statistically significant at some confidence level. 


timating the accuracy of a classifier. One of the simplest approaches is to train B 
classifiers, each with a different bootstrap data set, and test on other bootstrap data 
sets. The bootstrap estimate of the classifier accuracy is simply the mean of these 
bootstrap accuracies. In practice, the high computational complexity of bootstrap es- 
timation of classifier accuracy is rarely worth possible improvements in that estimate. 
In Sect. 9.5.1 we shall discuss bagging, a useful modification of bootstrap estimation. 


9.6.4 Maximum-likelihood model comparison 


Recall first the maximum-likelihood parameter estimation methods discussed in Chap. ??. 

Given a model with unknown parameter vector 0, we find the value Ô which maxi- 

mizes the probability of the training data, i.e., p(D|@). Maximum-likelihood model 
ML-II comparison or maximum-likelihood model selection — sometimes called ML-II — is 

a direct generalization of those techniques. The goal here is to choose the model that 

best explains the training data, in a way that will become clear below. 


We again let h; € H represent a candidate hypothesis or model (assumed discrete 
for simplicity), and D the training data. The posterior probability of any given model 
is given by Bayes’ rule: 


x P(D|h;)P(h,), (40) 


where we will rarely need the normalizing factor p(D). The data-dependent term, 

EVIDENCE P(D|h;), is the evidence for h;; the second term, P(h;), is our subjective prior over 
the space of hypotheses — it rates our confidence in different models even before the 
data arrive. In practice, the data-dependent term dominates in Eq. 40, and hence the 
priors P(h;) are often neglected in the computation. In maximum-likelihood model 
comparison, we find the maximum likelihood parameters for each of the candidate 
models, calculate the resulting likelihoods, and select the model with the largest such 
likelihood in Eq. 40 (Fig. 9.12). 
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Figure 9.12: The evidence (i.e., probability of generating different data sets given a 
model) is shown for three models of different expressive power or complexity. Model 
hı is the most expressive, since with different values of its parameters the model can 
fit a wide range of data sets. Model hg is the most restrictive of the three. If the 
actual data observed is DY, then maximum -likelihood model selection states that we 
should choose ha, which has the highest evidence. Model ha “matches” this particular 
data set better than do the other two models, and should be selected. 


9.6.5 Bayesian model comparison 


Bayesian model comparison uses the full information over priors when computing 
posterior probabilities in Eq. 40. In particular, the evidence for a particular hypothesis 
is an integral, 


P(D|h;) = J p(D|8, h;)p(|D, h;)d0, (41) 


where as before O describes the parameters in the candidate model. It is common for 
the posterior P(@|D,h;) to be peaked at 0, and thus the evidence integral can often 
be approximated as: 


P(D|hi) = P(D[O, hi) p(0|h;)A8 - (42) 


best fit Occam factor 
likelihood 
Before the data arrive, model h; has some broad range of model parameters, 
denoted by A°@ and shown in Fig. 9.13. After the data arrive, a smaller range is 
commensurate or compatible with D, denoted A0. The Occam factor in Eq. 42, 


7 A90 
Occam factor = p(0|h;)A0 = INT (43) 


param. vol. commensurate with D 


param. vol. commensurate with any data’ 


is the ratio of two volumes in parameter space: 1) the volume that can account for data 
D and 2) the prior volume, accessible to the model without regard to D. The Occam 
factor has magnitude less than 1.0; it is simply the factor by which the hypothesis 
space collapses by the presence of data. The more the training data, the smaller the 
range of parameters that are commensurate with it, and thus the greater this collapse 
in the parameter space and the larger the Occam factor (Fig. 9.13). 
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Figure 9.13: In the absence of training data, a particular model h has available a 
large range of possible values of its parameters, denoted A%9. In the presence of a 
particular training set D, a smaller range is available. The Occam factor, A@/A°9, 
measures the fractional decrease in the volume of the model’s parameter space due 
to the presence of training data D. In practice, the Occam factor can be calculated 
fairly easily if the evidence is approximated as a k-dimensional Gaussian, centered on 
the maximum-likelihood value 6. 


Naturally, once the posteriors for different models have been calculated by Eq. 42 & 
40, we select the single one having the highest such posterior. (Ironically, the Bayesian 
model selection procedure is itself not truly Bayesian, since a Bayesian procedure 
would average over all possible models when making a decision.) 

The evidence for h;, i.e., P(D|h;), was ignored in a maximum-likelihood setting 
of parameters 6: nevertheless it is the central term in our comparison of models. As 
mentioned, in practice the evidence term in Eq. 40 dominates the prior term, and it 
is traditional to ignore such priors, which are often highly subjective or problematic 
anyway (Problem 38, Computer exercise 7). This procedure represents an inherent 
bias towards simple models (small A0); models that are overly complex (large A@) are 
automatically self-penalizing where “overly complex” is a data-dependent concept. 

In the general case, the full integral of Eq. 41 is too difficult to calculate ana- 
lytically or even numerically. Nevertheless, if 0 is k-dimensional and the posterior 
can be assumed to be a Gaussian, then the Occam factor can be calculated directly 
(Problem 37), yielding: 


P(D|h;) ~ P(D|O, hi) p(6|h,)(27)*P E. (44) 
SPS A ŘĖŮ 
best fit Occam factor 
likelihood 
where 
2Inp(O|D, hi 
p = Pinp(81D, h) e 
00 
is a Hessian matrix — a matrix of second-order derivatives — and measures how 


“peaked” the posterior is around the value Ô. Note that this Gaussian approximation 
does not rely on the fact that the underlying model of the distribution of the data 
in feature space is or is not Gaussian. Rather, it is based on the assumption that 
the evidence distribution arises from a large number of independent uncorrelated 
processes and is governed by the Law of Large Numbers. The integration inherent 
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in Bayesian methods is simplified using this Gaussian approximation to the evidence. 
Since calculating the needed Hessian via differentiation is nearly always simpler than 
a high-dimensional numerical integration, the Bayesian method of model selection 
is not at a severe computational disadvantage relative to its maximum likelihood 
counterpart. 

There may be a problem due to degeneracies in a model — several parameters could 
be relabeled and leave the classification rule (and hence the likelihood) unchanged. 
The resulting degeneracy leads, in essence, to an “overcounting” which alters the 
effective volume in parameter space. Degeneracies are especially common in neu- 
ral network models where the parameterization comprises many equivalent weights 
(Chap. ??). For such cases, we must multiply the right hand side of Eq. 42 by the 
degeneracy of Ô in order to scale the Occam factor, and thereby obtain the proper 
estimate of the evidence (Problem 42). 


Bayesian model selection and the No Free Lunch Theorem 


There seems to be a fundamental contradiction between two of the deepest ideas 
in the foundation of statistical pattern recognition. On the one hand, the No Free 
Lunch Theorem states that in the absence of prior information about the problem, 
there is no reason to prefer one classification algorithm over another. On the other 
hand, Bayesian model selection is theoretically well founded and seems to show how 
to reliably choose the better of two algorithms. 

Consider two “composite” algorithms — algorithm A and algorithm B — each 
of which employs two others (algorithm 1 and algorithm 2). For any problem, algo- 
rithm A uses Bayesian model selection and applies the “better” of algorithm 1 and 
algorithm 2. Algorithm B uses anti-Bayesian model selection and applies the “worse” 
of algorithm 1 and algorithm 2. It appears that algorithm A will reliably outperform 
algorithm B throughout the full class of problems — in contradiction with Part 1 of 
the No Free Lunch Theorem. 

What is the resolution of this apparent contradiction? In Bayesian model selection 
we ignore the prior over the space of models, H, effectively assuming it is uniform. 
This assumption therefore does not take into account how those models correspond to 
underlying target functions, i.e., mappings from input to category labels. Accordingly, 
Bayesian model selection usually corresponds to a non-uniform prior over target func- 
tions. Moreover, depending on the arbitrary choice of model, the precise non-uniform 
prior will vary. In fact, this arbitrariness is very well-known in statistics, and good 
practitioners rarely apply the principle of indifference, assuming a uniform prior over 
models, as Bayesian model selection requires. Indeed, there are many “paradoxes” 
described in the statistics literature that arise from not being careful to have the prior 
over models be tailored to the choice of models (Problem 38). The No Free Lunch 
Theorem allows that for some particular non-uniform prior there may be a learning 
algorithm that gives better than chance — or even optimal — results. Apparently 
Bayesian model selection corresponds to non-uniform priors that seem to match many 
important real-world problems. 


9.6.6 The problem-average error rate 


The examples we have given thus far suggest that the problem with having only a 
small number of samples is that the resulting classifier will not perform well on new 
data — it will not generalize well. Thus, we expect the error rate to be a function of 


PRINCIPLE OF 
INDIFFERENCE 
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the number n of training samples, typically decreasing to some minimum value as n 
approaches infinity. To investigate this analytically, we must carry out the following 
familiar steps: 


1. Estimate the unknown parameters from samples. 
2. Use these estimates to determine the classifier. 


3. Calculate the error rate for the resulting classifier. 


In general this analysis is very complicated. The answer depends on everything — on 
the particular training patterns, on the way they are used to determine the classifier, 
and on the unknown, underlying probability structure. However, by using histogram 
approximations to the unknown probability densities and averaging appropriately, it 
is possible to draw some illuminating conclusions. 

Consider a case in which two categories have equal prior probabilities. Suppose 
that we partition the feature space into some number m of disjoint cells C1, ...,Cm. If 
the conditional densities p(x|w ) and p(x|w2) do not vary appreciably within any cell, 
then instead of needing to know the actual value of x, we need only know into which 
cell x falls. This reduces the problem to the discrete case. Let p; = P(x € C;|w1) 
and q; = P(x € C;|w2). Then, since we have assumed that P(w1) = P(w2) = 1/2, the 
vectors p = (p1,...,Pm)' and q = (q1, ---, qm)* determine the probability structure of 
the problem. If x falls in C;, the Bayes decision rule is to decide w; if pi > qj. The 
resulting Bayes error rate is given by 


1 m 
P(Elp, a) = 5 >) min[pi, ai] (46) 
i=1 


When the parameters p and q are unknown and must be estimated from a set 
of training patterns, the resulting error rate will be larger than the Bayes rate. The 
exact error probability will depend on the set of training patterns and the way in 
which they are used to obtain the classifier. Suppose that half of the samples are 
labeled wı and half are labeled wa, with nij being the number that fall in C; and 
are labeled w;. Suppose further that we design the classifier by using the maximum 
likelihood estimates p; = 2n;ı/n and G; = 2n;2/n as if they were the true values. 
Then a new feature vector falling in C; will be assigned to w1 if nj, > ni2. With all of 
these assumptions, it follows that the probability of error for the resulting classifier 
is given by 


1 1 
P(Elp,q,D) = 5 ets > pi. (47) 


Nil niz Nil Iniz 


To evaluate this probability of error, we need to know the true conditional proba- 
bilities p and q, and the set of training patterns, or at least the numbers n;¿. Different 
sets of n randomly chosen patterns will yield different values for P(E|p,q,D). We 
can use the fact that the numbers n;; have a multinomial distribution to average over 
all of the possible sets of n random samples and obtain an average probability of error 
P(Elp,q,n). Roughly speaking, this is the typical error rate one should expect for 
n samples. However, evaluation of this average error rate still requires knowing the 
underlying problem, i.e., the values for p and q. If p and q are quite different, the 
average error rate will be near zero, while if p and q are quite similar, it will be near 
0.5. 
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A sweeping way to eliminate this dependence of the answer upon the problem 
is to average the answer over all possible problems. That is, we assume some prior 
distribution for the unknown parameters p and q, and average P(Elp,q,n) with 
respect to p and q. The resulting problem-average probability of error P(E\m,n) 
will depend only on the number m of cells, the number n of samples, and the prior 
distributions. 

Of course, choosing the prior distributions is a delicate matter. By favoring easy 
problems, we can make P approach zero, and by favoring hard problems we can make 
P approach 0.5. We would like to choose a prior distribution corresponding to the 
class of problems we typically encounter, but there is no obvious way to do that. A 
bold approach is merely to assume that problems are “uniformly distributed,” i.e., 
that the vectors p and q are distributed uniformly over the simplexes 


Note that this uniform distribution over the space of p and q does not correspond to 
some purported uniform distribution over possible distributions or target functions, 
the issue pointed out in Sect. 9.6.5. 

Figure 9.14 summarizes simulation experiments and shows curves of P as a func- 
tion number of cells for fixed numbers of training patterns. With an infinite number of 
training patterns the maximum likelihood estimates are perfect, and P is the average 
of the Bayes error rate over all problems. The corresponding curve for P(E|m, co) 
decreases rapidly from 0.5 at m = 1 to the asymptotic value of 0.25 as m approaches 
infinity. The fact that P = 0.5 if m = 1 is not surprising, since if there is only one 
cell the decision must be based solely on the prior probabilities. The fact that P 
approaches 0.25 as m approaches infinity is aesthetically pleasing, since this value is 
halfway between the extremes of 0.0 and 0.5. The fact that the problem-average error 
rate is so high merely shows that many hopelessly difficult classification problems 
are included in this average. Clearly, it would be rash indeed to conclude that the 
“average” pattern recognition problem will have this error rate. 

However, the most interesting feature of these curves is that for every curve in- 
volving a finite number of samples there is an optimal number of cells. This is directly 
related to the fact that with a finite number of samples the performance will worsen 
if too many features are used. In this case it is clear why there exists an optimal 
number of cells for any given n and m. At first, increasing the number of cells makes 
it easier to distinguish between the distributions represented by the vectors p and q, 
thereby allowing improved performance. However, if the number of cells becomes too 
large, there will not be enough training patterns to fill them. Eventually, the number 
of patterns in most cells will be zero, and we must return to using just the ineffective a 
priori probabilities for classification. Thus, for any finite n, P(E|m, n) must approach 
0.5 as m approaches infinity. 

The value of m for which P(E|m,n) is minimum is quite small. For n = 500 
samples, it is somewhere around m = 200 cells. Suppose that we were to form the 
cells by dividing each feature axis into l intervals. Then with d features we would 
have m = I? cells. If | = 2, which is extremely crude quantization, this implies that 
using more than four or five binary features will lead to worse rather than better 
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Figure 9.14: The probability of error E on a two-category problem for a given number 
of samples, n, can be estimated by splitting the feature space into m cells of equal size 
and classifying a test point by according to the label of the most frequently represented 
category in the cell. The graphs show the average error of a large number of random 
problems having the given n and m indicated. 


performance. This is a very pessimistic result, but then so is the statement that the 
average error rate is 0.25. These numerical values are a consequence of the prior 
distribution chosen for the problems, and are of no significance when one is facing 
a particular problem. The main thing to be learned from this analysis is that the 
performance of a classifier certainly does depend on the number of training patterns, 
and that if this number is fixed, increasing the number of features beyond a certain 
point raises the variance unacceptably, and will be counterproductive. 


9.6.7 Predicting final performance from learning curves 


Training on very large data sets can be computationally intensive, requiring days, 
weeks or even months on powerful machines. If we are exploring and comparing several 
different classification techniques, the total training time needed may be unacceptably 
long. What we seek, then, is a method to compare classifiers without the need of 
training all of them fully on the complete data set. If we can determine the most 
promising model quickly and efficiently, we need then only train this model fully. 

One method is to use a classifier’s performance on a relatively small training set 
to predict its performance on the ultimate large training set. Such performance is 
revealed in a type of learning curve in which the test error is plotted versus the size 
of the training set. Figure 9.15 shows the error rate on an independent test set after 
the classifier has been fully trained on n’ < n points in the training set. (Note that 
in this form of learning curve the training error decreases monotonically and does not 
show “overtraining” evident in curves such as Fig. 9.9.) 

For many real-world problems, such learning curves decay monotonically and can 
be adequately described by a power-law function of the form 


Etest =at+ b/n” (49) 
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Figure 9.15: The test error for three classifiers, each fully trained on the given number 
n’ of training patterns, decreases in a typical monotonic power-law function. Notice 
that the rank order of the classifiers trained on n’ = 500 points differs from that for 
n’ = 10000 points and the asymptotic case. 


where a, band a > 1 depend upon the task and the classifier. In the limit of very 
large n’, the training error equals the test error, since both the training and test sets 
represent the full problem space. Thus we also model the training error as a power-law 
function, having the same asymptotic error, 


Erain = a—c/n’®, (50) 


If the classifier is sufficiently powerful, this asymptotic error, a, is equal to the Bayes 
error. Furthermore, such a powerful classifier can learn perfectly the small training 
sets and thus the training error (measured on the n’ points) will vanish at small n’, 
as shown in Fig. 9.16. 
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Figure 9.16: Test and training error of a classifier fully trained on data subsets of 
different size n’ selected randomly from the full set D. At low n’, the classifier can learn 
the category labels of the points perfectly, and thus the training error vanishes there. 
In the limit n! — oo, both training and test errors approach the same asymptotic 
value, a. If the classifier is sufficiently powerful and the training data is sampled i.i.d., 
then a is the Bayes error rate, Ep. 


Now we seek to estimate the asymptotic error, a, from the training and test errors 
on small and intermediate size training sets. From Eqs. 49 & 50 we find: 


b Cc 
Etest + Etrain = 24+ == Gq (51) 
n n 
b Cc 


Etest = Etrain 
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If we make the assumption of a = P and b = c, then Eq. 51 reduces to 


Prest ale Etrain = 2a (52) 
2b 


n/a 


Etest = Etrain 


Given this assumption, it is a simple matter to measure the training and test errors 
for small and intermediate values of n”, plot them on a log-log scale and estimate a, 
as shown in Fig. 9.17. Even if the approximations a = @ and b = c do not hold in 
practice, the difference Eyes: — Etrain nevertheless still forms a straight line on a log- 
log plot and the sum, s = b+c, can be found from the height of the log|Etest + Etrain) 
curve. The weighted sum cEtest + bEtrain will be a straight line for some empirically 
set values of b and c, constrained to obey b + c = s, enabling a to be estimated 
(Problem 41). Once a has been estimated for each in the set of candidate classifiers, 
the one with the lowest a is chosen and must be trained on the full training set D. 
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Figure 9.17: If the test and training errors versus training set size obey the power-law 
functions of Eqs. 49 & 50, then the log of the sum and log of the difference of these 
errors are straight lines on a log-log plot. The estimate of the asymptotic error rate 
a is then simply related to the height of the log|Etest + Etrain] line, as shown. 


9.6.8 The capacity of a separating plane 


Consider the partitioning of a d-dimensional feature space by a hyperplane w'x+wo = 
0, as might be trained by the Perceptron algorithm (Chap. ??). Suppose that we are 
given n sample points in general position, that is, with no subset of d+1 points falling 
in a (d — 1)-dimensional subspace. Assume each point is labeled either wı or wa. Of 
the 2” possible dichotomies of n points in d dimensions, a certain fraction f(n, d) 
are said to be linear dichotomies. These are the labellings for which there exists a 
hyperplane separating the points labeled wı from the points labeled wa. It can be 
shown (Problem 40) that this fraction is given by 


1 n<d+1 
(MÓ=3 25 (7D n>d+1 


a | 
i=0 


(53) 


as plotted in Fig. 9.18 for several values of d. 
To understand the issue more fully, consider the one-dimensional case with four 
points; according to Eq. 53, we have f(n = 4,d = 1) = 0.5. The table shows 
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schematically all sixteen of the equally likely labels for four patterns along a line. 
(For instance, 0010 indicates that the labels are assigned w141w2401.) The x marks 
those arrangements that are linearly separable, i.e., in which a single point decision 
boundary can separate all w; patterns from all wz patterns. Indeed as given by Eq. 53, 
8 of the 16 — half — are linearly separable. 


labels lin. sep.? labels lin. sep.? 
0000 x 1000 x 
0001 x 1001 
0010 1010 
0011 x 1011 
0100 1100 x 
0101 1101 
0110 1110 x 
0111 x 1111 x 


Note from Fig. 9.18 that all dichotomies of d+ 1 or fewer points are linear. This 
means that a hyperplane is not overconstrained by the requirement of correctly clas- 
sifying d+ 1 or fewer points. In fact, if d is large it is not until n is a sizable fraction 
of 2(d + 1) that the problem begins to become difficult. At n = 2(d + 1), which is 
sometimes called the capacity of a hyperplane, half of the possible dichotomies are still 
linear. Thus, a linear discriminant is not effectively overdetermined until the number 
of samples is several times as large as the dimensionality of the feature space or subset 
of the problems. This is often expressed as: “generalization begins only after learning 
ends.” Alternatively, we cannot expect a linear classifier to “match” a problem, on 
average, if the dimension of the feature space is greater than n/2— 1. 
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Figure 9.18: The fraction of dichotomies of n points in d dimensions that are linear, 
as given by Eq. 53. 


9.7 Combining classifiers 


We have already mentioned classifiers whose decision is based on the outputs of com- 
ponent classifiers (Sects. 9.5.1 & 9.5.2). Such full classifiers are variously called mixture 
of expert models, ensemble classifiers, modular classifiers or occasionally pooled clas- 
sifiers. Such classifiers are particularly useful if each of its component classifiers is 
highly trained — i.e., an “expert” — in a different region of the feature space. We 
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first consider the case where each component classifier provides probability estimates. 
Later, in Sect. 9.7.2 we consider the case where component classifiers provide rank 
order information or one-of-c outputs. 


9.7.1 Component classifiers with discriminant functions 


We assume that each pattern is produced by a mixture model, in which first some 
fundamental process or function indexed by r (where 1 < r < k) is randomly chosen 
according to distribution P(r|x, 09) where 09 is a parameter vector. Next, the selected 
process r emits an output y (e.g., a category label) according to P(y|x, 0°), where the 
parameter vector 0? describes the state of the process. (The superscript 0 indicates 
the properties of the generating model. Below, terms without this superscript refer 
to the parameters in a classifier.) The overall probability of producing output y is 
then the sum over all the processes according to: 


k 
P(y|x, 8°) = $ | P(r|x, 1°) P(y|x, 8°), (54) 
f=1 
where @° = (05, 6°,...,02]* represents the vector of all relevant parameters. Equa- 


tion 54 describes a mixture density, which could be discrete or continuous (Chap. ??). 

Figure 9.19 shows the basic architecture of an ensemble classifier whose task is 
to classify a test pattern x into one of c categories; this architecture matches the 
assumed mixture model. A test pattern x is presented to each of the k component 
classifiers, each of which emits c scalar discriminant values, one for each category. The 
c discriminant values from component classifier r are grouped and marked g(x, 0,) in 
the figure, with 


X 9rj = 1 for all r. (55) 


j=1 


All discriminant values from component classifier r are multiplied by a scalar weight 
wr, governed by the gating subsystem, which has a parameter vector 09. Below we 
shall use the conditional mean of the mixture density, which can be calculated from 
Eq. 54 


k 
u = Elylx, 0] = $ wru, (56) 
f=] 


where u, is the conditional mean associated with P(y|x, 0°). 

The mixture-of-experts architecture is trained so that each component classifier 
models a corresponding process in the mixture model, and the gating subsystem 
models the mixing parameters P(r|x, 09) in Eq. 54. The goal is to find parameters 
that maximize the log-likelihood for n training patterns x!,...x” in set D: 


n k 
l(D,O) = 2 In (>: P(rix* 00) P(y xi, 0.) s (57) 


A straightforward approach is to use gradient descent on the parameters, where the 
derivatives are (Problem 43) 
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Figure 9.19: The mixture of experts architecture consists of k component classifiers 
or “experts,” each of which has trainable parameters 0;,i = 1,...,k. For each input 
pattern x, each component classifier i gives estimates of the category membership 
Jir = P(w,|x,0;). The outputs are weighted by the gating subsystem governed by 
parameter vector 09, and pooled for ultimate classification. 


Ol(D, ©) i 0 2 
= P(rl\y’,x')—lIn|P(y'|x',0,)] for r=1,...k 58 
m A PP) (58) 
and 
as = Y (P(rly*,x*) — wi). (59) 
T i=1 


Here (P(rly*,x*) is the posterior probability of process r conditional on the in- 
put and output being xê and y”, respectively. Moreover, w! is the prior probabil- 
ity P(r|x*) that process r is chosen given the input is x‘. Gradient descent ac- 
cording to Eq. 59 moves the prior probabilities to the posterior probabilities. The 
Expectation-Maximization (EM) algorithm can be used to train this architecture as 
well (Chap. ??). 

The final decision rule is simply to choose the category corresponding to the max- 
imum discriminant value after the pooling system. An alternative, winner-take-all 
method is to use the decision of the single component classifier that is “most confi- 
dent,” i.e., has the largest single discriminant value 9,;. While the winner-take-all 
method is provably sub-optimal, it nevertheless is simple and can work well if the 
component classifiers are experts in separate regions of the input space. 

We have skipped over a problem: how many component classifiers should be used? 
Of course, if we have prior information about the number of component processes that 
generated the mixture density, this should guide our choice of k. In the absence of 
such information, we may have to explore different values of k, thereby tailoring the 
bias and variance of the full ensemble classifier. Typically, if the true number of com- 
ponents in the mixture density is k*, a mixture-of-experts more than k* component 
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classifiers will generalize better than one with fewer than k* component classifiers be- 
cause the extra component classifiers learn to duplicate one another and hence become 
redundant. 


9.7.2 Component classifiers without discriminant functions 


Occasionally we seek to form an ensemble classifier from highly trained component 
classifiers, some of which might not themselves compute discriminant functions. For 
instance, we might have four component classifiers — a k-nearest-neighbor classifier, 
a decision tree, a neural network, and a rule-based system — all addressing the same 
problem. While a neural network would provide analog values for each of the c 
categories, the rule-based system would give only a single category label (i.e., a one- 
of-c representation) and the k-nearest neighbor classifier would give only rank order 
of the categories. 

In order to integrate the information from the component classifiers we must con- 
vert the their outputs into discriminant values obeying the constraint of Eq. 55 so 
we can use the framework of Fig. 9.19. The simplest heuristics to this end are the 
following: 


Analog If the outputs of a component classifier are analog values g;, we can use the 
softmax transformation, 


(60) 


to convert them to values g;. 


Rank order If the output is a rank order list, we assume the discriminant function 
is linearly proportional to the rank order of the item on the list. Of course, the 
resulting g; should then be properly normalized, and thus sum to 1.0. 


One-of-c If the output is a one-of-c representation, in which a single category is 


identified, we let g; = 1 for the 7 corresponding to the chosen category, and 0 
otherwise. 


The table gives a simple illustration of these heuristics. 


Analog value Rank order One-of-c 
Ji Ji Ji Yi Ji Gi 
0.4 0.158 3rd 4/21 = 0.194 0 0 
0.6 0.193 6th 1/21 = 0.048 1 10 
0.9 0.260 5th 2/21 = 0.095 0 0 
0.3 0.148 Ist 6/21 = 0.286 0 0 
0.2 0.129 2nd 5/21 = 0.238 0 0 
0.1 0.111 4th 3/21 = 0.143 0 0 


Once the outputs of the component classifiers have been converted to effective 
discriminant functions in this way, the component classifiers are themselves held fixed, 
but the gating network is trained as described in Eq. 59. This method is particularly 
useful when several highly trained component classifiers are pooled to form a single 
decision. 
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Summary 


The No Free Lunch Theorem states that in the absence of prior information about 
the problem there are no reasons to prefer one learning algorithm or classifier model 
over another. Given that a finite set of feature values are used to distinguish the 
patterns under consideration, the Ugly Duckling Theorem states that the number of 
predicates shared by any two different patterns is constant, and does not depend upon 
the choice of the two objects. Together, these theorems highlight the need for insight 
into proper features and matching the algorithm to the data distribution — there 
is no problem independent “best” learning or pattern recognition system nor feature 
representation. In short, formal theory and algorithms taken alone are not enough; 
pattern classification is an empirical subject. 


Two ways to describe the match between classifier and problem are the bias and 
variance. The bias measures the accuracy or quality of the match (high bias implies 
a poor match) and the variance measures the precision or specificity of the match (a 
high variance implies a weak match). The bias-variance dilemma states that learning 
procedures with increased flexibility to adapt to the training data (e.g., have more 
free parameters) tend to have lower bias but higher variance. In classification there 
is a non-linear relationship between bias and variance, and low variance tends to be 
more important for classification than low bias. If classifier models can be expressed 
as binary strings, the minimum description length principle states that the best model 
is the one with the minimum sum of such a model description and the training data 
with respect to that model. This general principle can be extended to cover model- 
specific heuristics such as weight decay and pruning in neural networks, regularization 
in specific models, and so on. 


The basic insight underlying resampling techniques — such as the bootstrap, jack- 
knife, boosting, and bagging — is that multiple data sets selected from a given data 
set enable the value and ranges of arbitrary statistics to be computed. In classifica- 
tion, boosting techniques such as AdaBoost adjust the match of full classifier to the 
problem (and thus the bias and variance) even for an arbitrary basic classification 
method. In learning with queries, the classifier system presents query patterns to an 
oracle for labeling. Such learning is most efficient if informative patterns — ones for 
which the classifier is least certain — are presented as queries. 


There are a number of methods for estimating the final accuracy of classifiers and 
thus comparing them. Each is based on assumptions, for example that the parametric 
model is known, or that the form of its learning curve is known. Cross validation, 
jackknife and bootstrap methods are closely related techniques that use subsets of 
the training data to estimate classifier accuracy. Maximum likelihood (ML-IT) and 
Bayesian methods — extensions of methods for setting parameters — can be used 
to compare and choose among models. A key term in Bayesian model selection is 
the Occam factor, which describes how the allowable volume in parameter space 
shrinks due to constraints imposed by the training data. The method penalizes “overly 
complex” models, where such complexity is a data-dependent property. 


There are a number of methods for combining the outputs of separate component 
or “expert” classifiers, such as linear weighting, winner-takes-all, and so on. Overall 
classification is generally better when the decision rules of the component classifiers 
differ and provide complementary information. 
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Bibliographical and Historical Remarks 


The No Free Lunch Theorem appears in [109] as well as Wolpert’s collection of con- 
tributions on the foundations of the theory of generalization [108]. Schaffer’s “conser- 
vation law in generalization,” is a reformulation of one of the Parts of the Theorem, 
and was the inspiration for Fig. 9.1 [81]. The Ugly Duckling Theorem was proven in 
[104], which also explores some of its philosophical implications [77]. 

The foundational work on Kolmogorov complexity appears in [57, 58, 91, 92], but 
a short elementary overview [14] and Chaitin’s [15] and particularly Li and Vitányi's 
citeLiVitanyi:97 books are far more accessible. Barron and Cover were the first to use 
a minimum description length (MDL) principle to estimate densities [7]. There are 
several versions of MDL [78, 79], such as the Akaike Information Criterion (AIC) [1, 2] 
and the Bayes Information Criterion (BIC) [84] (which differ from MDL by relative 
weighting of model penalty). Likewise, the Network Information Criterion (NIC) can 
be used to compare neural networks of the same architecture [71]. More generally, 
neural network pruning and general regularization methods can be cast as “minimum 
description” principles, but with different measures for model and fit of the data [65]. 

Convincing theoretical and philosophical justifications of Occam’s razor have been 
elusive. Karl Popper has argued that Occam’s razor is without operational value, 
since there is no clear criterion or measure of simplicity [74], a point echoed by other 
philosophers [90]. It is worth pointing out alternatives to Occam’s razor, which Isaac 
Newton cast in Principia as “Natura enim simplex est, et rerum causis superfluis 
non luxuriat,” or “for nature indeed is simple, and does not luxuriate in superfluous 
causes” [72]. The first alternative stems from Epicurus (342?-270?BC), who in a 
letter to Pythocles stated what we now call the principle of multiple explanations or 
principle of indifference: if several theories are consistent with the data, retain all 
such theories [29]. The second is a restatement of Bayes approach: the probability of 
a model or hypothesis being true is proportional to the designer’s prior belief in the 
hypothesis multiplied by the conditional probability of the data given the hypothesis in 
question. Occam’s razor, or here favoring “simplicity” in classifiers, can be motivated 
by considering the cost (difficulty) of designing the classifier and the principle of 
bounded rationality — that we often settle for an adequate but not necessarily the 
optimal solution [87]. An empirical study showing that simple classifiers often work 
well can be found in [45]. 

The basic bias-variance decomposition and bias-variance dilemma [37] in regression 
appear in many statistics books [41, 16]. Geman et al. give a very clear presentation in 
the context of neural networks, but their discussion of classification is only indirectly 
related to their mathematical derivations for regression [35]. Our presentation for 
classification (zero-one loss) is based on Friedman's important paper [32]; the bias- 
variance decomposition has been explored in other non-quadratic cost functions as 
well [42]. 

Quenouille introduced the term jackknife in 1956 [76]. The theoretical founda- 
tions of resampling techniques are presented in Efron's clear book [28], and practical 
guides to their use include [36, 25]. Papers on bootstrap techniques for error estima- 
tion include [48]. Breiman has been particularly active in introducing and exploring 
resampling methods for estimation and classifier design, such as bagging [11] and gen- 
eral arcing [13]. AdaBoost [31] builds upon Schapire's analysis of the strength of weak 
learnability [82] and Freund’s early work in the theory of learning [30]. Boosting in 
multicategory problems is a bit more subtle than in two-category problems we dis- 
cussed [83]. Angluin's early work on queries for concept learning [3] was generalized to 
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active learning by Cohn and many others [18, 20] and is fundamental to some efforts 
in collecting large databases [93, 95, 94, 99]. 

Cross validation was introduced by Cover [23], and has been used extensively in 
conjunction with classification methods such as neural network. Estimates of error 
under different conditions include [34, 110, 103] and an excellent paper, which derives 
the size of test set needed for accurate estimation of classification accuracy is [39]. 
Bowyer and Phillip’s book covers empirical evaluation techniques in computer vision 
[10], many of which apply to more general classification domains. 

The roots of maximum likelihood model selection stem from Bayes himself, but one 
of the earlier technical presentations is [38]. Interest in Bayesian model selection was 
revived in a series of papers by MacKay, whose primary interest was in applying the 
method to neural networks and interpolation [66, 69, 68, 67]. These model selection 
methods have subtle relationships to minimum description length (MDL) [78] and 
so-called maximum entropy approaches — topics that would take us a bit beyond our 
central concerns. Cortes and her colleagues pioneered the analysis of learning curves 
for estimating the final quality of a classifier [22, 21]. No rate of convergence results 
can be made in the arbitrary case for finding the Bayes error, however [6]. Hughes 
[46] first carried out the required computations and obtained in Fig. 9.14. 

Extensive books on techniques for combining general classifiers include [55, 56] and 
for combining neural nets in particular include [86, 9]. Perrone and Cooper described 
the benefits that arise when expert classifiers disagree [73]. Dasarathy’s book [24] has 
a nice mixture of theory (focusing more on sensor fusion than multiclassifier systems 
per se) and a collection of important original papers, including [43, 61, 96]. The simple 
heuristics for converting 1-of-c and rank order outputs to numerical values enabling 
integration were discussed in [63]. The hierarchical mixture of experts architecture and 
learning algorithm was first described in [51, 52]. A specific hierarchical multiclassifier 
technique is stacked generalization [107, 88, 89, 12], where for instance Gaussian kernel 
estimates at one level are pooled by yet other Gaussian kernels at a higher level. 

We have skipped over a great deal of work from the formal field of computational 
learning theory. Such work is generally preoccupied with convergence properties, 
asymptotics, and computational complexity, and usually relies on simplified or general 
models. Anthony and Biggs’ short, clear and elegant book is an excellent introduction 
to the field [5]; broader texts include [49, 70, 53]. Perhaps the work from the field most 
useful for pattern recognition practitioners comes from weak learnability and boosting, 
mentioned above. The Probably approximately correct (PAC) framework, introduced 
by Valiant [98], has been very influential in computation learning theory, but has had 
only minor influence on the development of practical pattern recognition systems. 
A somewhat broader formulation, Probably almost Bayes (PAB), is described in [4]. 
The work by Vapnik and Chervonenkis on structural risk minimization [102], and later 
Vapnik-Chervonenkis (VC) theory [100, 101], derives (among other things) expected 
error bounds; it too has proven influential to the theory community. Alas, the bounds 
derived are somewhat loose in practice [19, 106]. 


Problems 


@ Section 9.2 


1. One of the “conservations laws” for generalization states that the positive gen- 
eralization performance of an algorithm in some learning situations must be offset 
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by negative performance elsewhere. Consider a very simple learning algorithm that 
seems to contradict this law. For each test pattern, the prediction of the majority 
learning algorithm is merely the category most prevalent in the training data. 


(a) Show that averaged over all two-category problems of a given number of features 
that the off-training set error is 0.5. 


(b) Repeat (a) but for the minority learning algorithm, which always predicts the 
category label of the category least prevalent in the training data. 


(c) Use your answers from (a) & (b) to illustrate Part 2 of the No Free Lunch 
Theorem (Theorem 9.1). 


2. Prove Part 1 of Theorem 9.1, i.e., that uniformly averaged over all target functions 
F, E¡(E|F,n) — €2(E|F,n) = 0. Summarize and interpret this result in words. 
3. Prove Part 2 of Theorem 9.1, i.e., for any fixed training set D, uniformly averaged 
over F, E¡(E|F,D) — €2(E|F,D) = 0. Summarize and interpret this result in words. 
4. Prove Part 3 of Theorem 9.1, i.e., uniformly averaged over all priors P(F), 
E¡(Eln) — €2(E|n) = 0. Summarize and interpret this result in words. 
5. Prove Part 4 of Theorem 9.1, i.e., for any fixed training set D, uniformly averaged 
over P(P), €\(E|D) — €2(E|D) = 0. Summarize and interpret this result in words. 
6. Suppose you call an algorithm better if it performs slightly better than average 
over most problems, but very poorly on a small number of problems. Explain why 
the NFL Theorem does not preclude the existence of algorithms “better” in this way. 
7. Show by simple counterexamples that the averaging in the different Parts of the 
No Free Lunch Theorem (Theorem 9.1) must be “uniformly.” For instance imagine 
that the sampling distribution is a Dirac delta distribution centered on a single tar- 
get function, and algorithm 1 guesses the target function exactly while algorithm 2 
disagrees with algorithm 1 on every prediction. 


(a) Part 1 
(b) Part 2 
(c) Part 3 
(d) Part 4 


8. State how the No Free Lunch theorems imply that you cannot use training data to 
distinguish between new problems for which you generalize well from those for which 
you generalize poorly. Argue by reductio ad absurdum: that if you could distinguish 
such problems, then the No Free Lunch Theorem would be violated. 


9. Prove the relation >> (") = (1+ 1)” = 2” of Eq. 5 two ways: 
r=0 


(a) State the polynomial expansion of (x + y)” as a summation of coefficients and 
powers of x and y. Then, make a simple substitution for x and y. 


(b) Prove the relation by induction. Let K(n) = » ("). First confirm that the 
r=0 
relation is valid for n = 1, i.e., that K(1) = 2'. Now prove that K(n +1) = 
2K(n) for arbitrary n. 
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10. Consider the number of different Venn diagrams for k binary features f1,..., fr. 
(Figure 9.2 shows several of these configurations for the k = 3 case.) 


(a) How many functionally different Venn diagrams exist for the k = 2 case? Sketch 
all of them. For each case, state how many different regions exists, i.e., how many 
different patterns can be described. 


(b) Repeat part (a) for the k = 3 case. 
(c) How many functionally different Venn diagrams exist for the arbitrary k case? 


11. While the text outlined a proof of the Ugly Duckling Theorem (Theorem 9.2) 
this problem asks you to fill in some of the details and explain some of its implications. 


(a) The discussion in the text assumed the classification problem had no constraints, 
and thus could be described by the most general Venn diagram, in which all 
predicates of a given rank r were present. How do the derivations change, if at 
all, if we know that there are constraints provided by the problem, and thus not 
all predicates of a given rank are possible, as in Fig. 9.2 (b) & (c)? 


(b) Someone sees two cars, A and B, made by the same manufacturer in the same 
model year, both are four-door and have same engine type, but they differ solely 
in that one is red the other green. Car C is made by a different manufacturer, 
has a different engine, is two-door and is blue. Explain in as much detail as 
possible why, even in this seemingly clear case, that in fact there are no prior 
reasons to view cars A and B as any “more similar” than cars B and C. 


12. Suppose we describe patterns by means of predicates of a particular rank r*. 
Show the Ugly Duckling Theorem (Theorem 9.2) applies to any single level r*, and 
thus for all predicates up to an arbitrary maximum level. 

13. Make some simple assumptions and state, using O(-) notation, the Kolmogorov 
complexity of the following binary strings: 


(a) 010110111011110... 
EZ ng<2<—z— 


n 


(b) 000...00100...000 
SS 


n 


(c) e =10.10110111111000010...2 
(d) 2e = 101.01101111110000101...2 


(e) The binary digits of 7, but where every 100th digit is changed to the numeral 
1. 


(£) The binary digits of 7, but where every nth digit is changed to the numeral 1. 


14. Recall the notation from our discussion of the No Free Lunch Theorem and of 
Kolmogorov complexity. Suppose we use a learning algorithm with uniform P(h|D). 
In that case K(h, D) = K(D) in Eq. 8. Explain and interpret this result. 

15. Consider two binary strings xı and x. Explain why the Kolmogorov complexity 
of the pair obeys K(x1,v2) < K(x1) + K(x2) + c for some positive constant c. 

16. Case where MDL is easier than imposing probabilities. xxx 


CONSISTENT 
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17. The Berry paradoz is related to famous liar’s paradox (“this statement is false”), 
as well as a number of paradoxes in set theory explored by Bertrand Russell and 
Kurt Gödel. The Berry paradox shows indirectly how the notion of Kolmogorov 
complexity can be difficult and subtle. Consider describing positive integers by means 
of sentences, for instance “the number of fingers on a human hand,” or “the number 
of primes less than a million.” Explain why the definition “the least number that 
cannot be defined in less than twenty words” is paradoxical, and informally how this 
relates to the difficulty in computing Kolmogorov complexity. 


Q Section 9.3 


18. Expand the left hand side of Eq. 11 to get the right hand side, which expresses 
the mean-square error as a sum of a bias? and variance. Can bias ever be negative? 
Can variance ever be negative? 

19. Fill in the steps leading to Eq. 18, i.e., 


Prlg(x; D) # y] = |2F (x) — 1|Prlg(x; D) A yB] + Prlys 4 y] 


where the target function is F(x), g(x; D) is the computed discriminant value, yg is 
the Bayes discriminant value. 

20. Assume that the probability of a obtaining a particular discriminant value for 
pattern x for a training algorithm trained with data D, denoted p(g(x; D)) is a 
Gaussian. Use this Gaussian assumption and Eq. 19 to derive Eq. 20. 

21. Derive the jackknife estimate of the bias in Eq. 29. 

22. Prove that in the limit of B — oo, the bootstrap estimate of the variance of the 
mean is the same as the standard estimate of the variance of the mean. 


HB Section 9.4 


23. Prove that Eq. 24 for the average of the leave one out means, u(.), is equivalent 
to Eq. 22 for the sample mean, ji. 

24. We say that an estimator is consistent if it converges to the true value in the 
limit of infinite data. Prove that the standard mean of Eq. 22 is not consistent for 
the distribution p(x) ~ tan~'(a — a) for any finite real constant a. 

25. Prove that the jackknife estimate of an arbitrary statistic 0 given in Eq. 30 is 
unbiased for estimating the true bias. 

26. Verify that Eq. 26 for the jackknife estimate of the variance of the mean is 
formally equivalent to the variance implied by the traditional estimate given in Eq. 23. 
27. Consider n points in one dimension. Use O(-) notation to express the computa- 
tional complexity associated with each of the following estimations. 


(a) The jackknife estimate of the mean. 

(b) The jackknife estimate of the median. 

(c) The jackknife estimate of the standard deviation. 
(d) The bootstrap estimate of the mean. 


(e) The bootstrap estimate of the median. 


(f) The bootstrap estimate of the standard deviation. 


BERRY 
PARADOX 
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@ Section 9.5 


28. What is the computational complexity of full jackknife estimate of accuracy and 
variance for an unpruned nearest-neighbor classifier (Chap. ??)? 

29. In standard boosting applied to a two-category problem, we must create a data 
set that is “most informative” with respect to the current state of the classifier. Why 
does this imply that half of its patterns should be classified correctly, rather than none 
of them? In ac category problem, what portion of the patterns should be misclassified 
in a “most informative” set? 

30. In active learning, learning can be speeded by creating patterns that are “infor- 
mative,” i.e., those for which the two largest discriminants are approximately equal. 
Consider the two-category case where for any point x in feature space, discriminant 
values gı and ga are returned by the classifier. Write pseudocode that takes two points 
— x, classified as w; and xg classified as wọ — and rapidly finds a new point x3 that 
is “near” the current decision boundary, and hence is “informative.” Assume only 
that the discriminant functions are monotonic along the line linking x; and xg. 

31. Consider AdaBoost with an arbitrary number of component classifiers. 


(a) State clearly any assumptions you make, and derive Eq. 37 for the ensemble 
training error of the full boosted system. 


(b) Recall that the training error for a weak learner applied to a two-category prob- 
lem can be written Ep = 1/2 — Gk for some positive value Gk. The training 
error for the first component classifier is EH = 0.25. Suppose that Gk = 0.05 for 
all k = 1 to kmax. Plot the upper bound on the ensemble test error given by 
Eq. 37, such as shown in Fig. 9.7. 


(c) Suppose that Gk decreases as a function of k. Specifically, repeat part (b) with 
the assumption Gp = 0.05/k for k = 1 to kmax- 


@ Section 9.6 


32. The No Free Lunch Theorem implies that if all problems are equally likely, then 
cross validation must fail as often as it succeeds. Show this as follows: Consider algo- 
rithm 1 to be standard cross validation, and algorithm 2 to be anti-cross validation, 
which advocates choosing the model that does worst on a validation set. Argue that 
if cross validation were better than anti-cross validation overall, the No Free Lunch 
Theorem would be violated. 

33. Suppose we believe that the data for a pattern classification task from one 
category comes either from a uniform distribution p(x) ~ U (£1, £u) or from a normal 
distribution, p(x) ~ N(,07), but we have no reason to prefer one over the other. 
Our sample data is D = {.2,.5,.4,.3,.9,.7, .6}. 


(a) Find the maximum likelihood values of x, and x, for the uniform model. 
(b) Find the maximum likelihood values of u and ø for the Gaussian model. 


(c) Use maximum likelihood model selection to decide which model should be pre- 
ferred. 


34. Suppose we believe that the data for a pattern classification task from one 
category comes either from a uniform distribution bounded below by 0, i.e., p(a) ~ 


56 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING 


U(0, £u) or from a normal distribution, p(x) ~ N(p,0?), but we have no reason to 
prefer one over the other. Our sample data is D = {.2, .5,.4,.3, .9,.7, .6}. 


(a) Find the maximum likelihood values of x, for the uniform model. 
(b) Find the maximum likelihood values of u and ø for the Gaussian model. 


(c) Use maximum likelihood model selection to decide which model should be pre- 
ferred. 


(d) State qualitatively the difference between your solution here and that to Prob- 
lem 33, without necessarily having to solve that problem. In particular, what 
are the implications from the fact that the two candidate models have different 
numbers of parameters? 


35. Consider three candidate one-dimensional distributions each parameterized by 
an unknown value for its “center”: 


e Gaussian: p(x) ~ N(u, 1) 


for |x — u| < 1 


1— |x — 
e Triangle: p(x) ~ T(1,1) = { 0 oe otherwise 


e Uniform: p(x) ~ U(u— 1,44 1). 


We are given the data D = {—0.9, —0.1,0.,0.1,0.9}, and thus, clearly the maximum 
likelihood solution 4 = 0 applies to each model. 


(a) Use maximum likelihood model selection to determine the best model for this 
data. State clearly any assumptions you make. 


(b) Suppose we are sure for each model that the center must lie in the range —1 < 
u < 1. Calculate the Occam factor for each model and the data given. 


(c) Use Bayesian model selection to determine the “best” model given D. 


36. Use Eq. 38 and generate curves of the form shown in Fig. 9.10. Prove analytically 
that the curves are symmetric with respect to the interchange $ — (1 — p) and 
p— (1—p). Explain the reasons for this symmetry. 

37. Let model h; be described by a k-dimensional parameter vector 0. State your 
assumptions and show that the Occam factor can be written as 


(On) (20) Pape, 


as given in Eq. 44, where the Hessian H is a matrix of second-order derivatives defined 


in Eq. 45. 
BERTRAND’S 38. Bertrand’s paradox shows how the notion of “uniformly distributed” models can 
PARADOX be problematic, and leads us to question the principle of indifference (cf., Computer 


exercise 7). Consider the following problem: Given a circle, find the probability that a 
“randomly selected” chord has length greater than the side of an inscribed equilateral 
triangle. 

Here are three possible solutions to this problem and their justifications, illustrated 
in the figure: 
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1. By definition, a chord strikes the circle at two points. We can arbitrarily rotate 
the figure so as to place one of those points at the top. The other point is equally 
likely to strike any other point on the circle. As shown in the figure at the left, 
one-third of such points (red) will yield a chord with length greater than that 
of the side of an inscribed equilateral triangle. Thus the probability a chord is 
longer than the length of the side of an inscribed equilateral triangle is P = 1/3. 


2. A chord is uniquely determined by the location of its midpoint. Any such mid- 
point that lies in the circular disk whose radius is half that of the full circle will 
yield a chord with length greater than that of an inscribed equilateral triangle. 
Since the area of this red disk is one-fourth that of the full circle, the probability 
is P= 1/4. 


3. We can arbitrarily rotate the circle such that the midpoint of a chord lies on a 
vertical line. If the midpoint lies closer to the center than half the radius of the 
circle, the chord will be longer than the side of the inscribed equilateral triangle. 
Thus the probability is P = 1/2. 


P=1/3 P=1/4 P=1/2 


Explain why there is little or no reason to prefer one of the solution methods over 
another, and thus the solution to the problem itself is ill-defined. Use your answer to 
reconcile Bayesian model selection with the No Free Lunch Theorem (Theorem 9.1). 
39. If k of n’ independent, randomly chosen test patterns are misclassified, then as 
given in Eq. 38 k has a binomial distribution 


Prove that the maximum likelihood estimate for p is then p = k/n’, as given in Eq. 39. 

40. Derive the relation for f(n, d), the fraction of dichotomies of n randomly chosen 
points in d dimensions that are linearly separable, given by Eq. 53. Explain why 
f(n,d) =1forn<d+1. 

41. Write pseudocode for an algorithm to determine the large n’ limit of the test 
error given the assumption of a power-law decrease in error described by Eq. 52 and 
illustrated in Fig. 9.17. 

42. Suppose a standard three-layer neural network having J input units, H hid- 
den units, single bias unit and 2 output units is trained on a two-category problem 
(Chap. ??). What is the degeneracy of the final assignment of weights? That is, 
how many ways can the weights be relabeled with the decision rule being unchanged? 
Explain how this degeneracy would need to be incorporated into a Bayesian model 
selection. 


EH Section 9.7 
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43. Let x’ and yê denote input and output vectors, respectively, and r index com- 
ponent processes (1 < r < k) in a mixture model. Use Bayes theorem, 


P(r|x')P(y'|x', 7) 


2 UA) 


P(rly",x") = 


to derive the derivatives Eqs. 58 & 59 used for gradient descent learning in the mixture 
of experts model. 


44. Suppose a mixture of experts classifier has k Gaussians component classifiers of 


arbitrary mean and covariance in d dimensions, N (u, X). Derive the specific learning 
rules for the parameters of each component classifier and for the gating subsystem, 
special cases of Eqs. 58 & 59. 


Computer exercises 


Several exercises will make use of the following three-dimensional data sampled from 
four categories, denoted w;. 


Wy a W3 W4 
sample Tı T2 X3 X1 T2 T3 Tı T2 T3 Tı T2 T3 
1 25 34 7.9] 4.2 4.9 11.3 2.9 15.5 46 | 169 124 0.2 
2 43 44 7.1) 11.7 5.3 10.5 3.6 13.9 98 | 12.1 16.8 2.1 
3 71 08 63 8.4 11.1 6.6 | 10.3 6.1 12.3 | 13.7 121 5.5 
4 14 -02 2.5 82 104 49] 8.2 5.5 7.1 | 11.9 134 3.4 
5 39 43 3.4 5.3 7.7 88)13.3 4.7 11.7 | 145 15.5 2.8 
6 32 6.8 5.1 7.9 4.5 9.5 6.6 8.1 16.7 | 156 149 4.4 
7 73 65 7.1 | 10.7 69 10.9 | 12.2 5.1 5:9 | 16.2 123 3.2 
8 -0.7 3.1 8.1 9.6 9.7 7.3 | 15.6 3.3 10.7 | 12.2 163 3.2 
9 2.8 5.9 2.2 8.2 11.2 6.3 4.6 10.1 13.8 | 14.5 129 -0.9 
10 61 76 43 5.3 10.1 4.9 9.1 4.4 8.9 | 15.8 156 4.5 


@ Section 9.2 


1. Consider the use of the minimum description length principle for the design of 
a binary decision tree classifier (Chap. ??). Each question at a node is of the form 
“is a; > 0?” (or alternatively “is x; < 07”). Specify each such question with 5 bits: 
two bits specify the feature queried (21,72, or x3), a single bit specifies whether the 
comparison is > or <, and four bits specify each @ as an integer 0 < 0 < 16. Assume 
the Kolmogorov complexity of the classifier is, up to an additive constant, the sum 
of the bits of all questions. Assume too that the Kolmogorov complexity of the data 
given the tree classifier is merely the entropy of the data at the leaves, also measured 
in bits. 


(a) Train your tree with the data from the four-category problem in the table above. 
Starting at the root, grow your tree a single node at a time, continuing until 
each node is as pure as possible. Plot as a function of the total number of nodes 
the Kolmogorov complexity of 1) the classifier, 2) the data with respect to the 
classifier, and 3) their sum (Eq. 8). Show the tree (including the questions at 
its nodes) having the minimum description length. 
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(b) The minimum description length principle gives a principled method for com- 
paring classifiers in which the resolution of parameters (e.g., weights, thresholds, 
etc.) can be altered. Repeat part (a) but using only three bits to specify each 
threshold 0 in the nodes. 


(c) Assume that the additive constants for the Kolmogorov complexities of your 
above classifiers are equal. Which of all the classifiers has the minimum descrip- 
tion length? 


EH Section 9.3 


2. Illustrate the bias-variance decomposition and the bias-variance dilemma for 
regression through simulations. Let the target function be F(a) = 1? with Gaussian 
noise of variance 0.1. First, randomly generate 100 data sets, each of size n = 10, 
by selecting a value of x uniformly in the range —1 < x < 1 and then applying F(x) 
with noise. Train any free parameters a; (by minimum square error criterion) in each 
of the regression functions in parts (a) — (d), one data set at a time. Then make a 
histogram of the sum-square error of Eq. 11 (cf. Fig. 9.4). For each model use your 
results to estimate the bias and the variance. 


(a 
(b 


Qe 


T 


T + a,7 


3 


(x) = 
(x) = 
(c) g(a) = 
(d) g(x) = 


x + aT + azg? + agg 


) 
) 
) 
) 9 
) 
) 


(e) Repeat parts (a) — (d) for 100 data sets of size n = 100. 
( 


f) Summarize all your above results, with special consideration of the bias-variance 


decomposition and dilemma, and the effect of the size of the data set. 


@ Section 9.4 


2. The trimmed mean of a distribution is merely the sample mean of the distribution 
from which some portion a (e.g., 0.1) of the highest and of the lowest points have 
been deleted. The trimmed mean is, of course, less sensitive to the presence of outliers 
than is the traditional sample mean. 


(a) Show how in the limit a — 0.5, the trimmed mean of a distribution is the 
median. 


(b) Let the data D be the x3 values of the 10 patterns in category wa in the table 
above. Write a program to determine the jackknife estimate of the median of 
D, and the jackknife estimate of the variance of this estimate. 


(c) Repeat part (b) but for the a = 0.1 trimmed mean and its variance. 
(d) Repeat part (b) but for the a = 0.2 trimmed mean and its variance. 


(e) Repeat parts (b) — (d) but where D has an additional (“outlier”) point at x3 = 
20. 


TRIMMED 
MEAN 
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(£) Interpret your results, with special attention to the sensitivity of the trimmed 
mean to outliers. 


@ Section 9.5 


3. Write a program to implement the AdaBoost procedure (Algorithm 1) with com- 
ponent classifiers whose linear discriminants are trained by the basic LMS algorithm 
(Algorithm ?? from Chap. ??). 


(a) Apply your system to the problem of discriminating the ten points in w; from 
the ten points in wa in the table above. Plot your training error as a function 
of the number of component classifiers. Be sure the graph extends to a kmar 
sufficiently high that the training error vanishes. 


(b) Define a “super-category” consisting of all the patterns in w; and wə in the 
table, and another super-category for the w3 and w4 patterns. Repeat part (a) 
for discriminating these super-categories. 


(c) Compare and interpret your graphs in (a) & (b), paying particular attention to 
the relative difficulties of the classification problems. 


4. Explore the value of active learning in a two-dimensional two-category problem 


in which the priors are Gaussians, p(x) ~ N (m; Xi) with pw, = (29) us = (2); 


Xi =% = ee 0 and P(w1) = P(w2) = 0.5. Throughout this problem, restrict data 


to be in the domain —10 < x; < +10, for i = 1,2. 


(a) State by inspection the Bayes classifier. This will be the decision used by the 
oracle in part (c). 


(b) Generate a training set of 100 points, 50 labeled w; sampled according to p(x) ~ 
N(1,, 21) and likewise 50 patterns according to p(x) ~ N(u3,®2). Train a 
nearest-neighbor classifier (Chap. ??) using your data, and plot the decision 
boundary in two dimensions. 


(c) Now assume there is an oracle, which can label any query pattern according to 
your answer in part (a), which we exploit through a particular form of active 
learning. To begin the learning, choose 10 points according to a uniform distri- 
bution in the domain —10 < x; < +10, for i = 1,2. Apply labels to these points 
according to the oracle to get Dı and Da, for each category. Now generate new 
query points as follows. Randomly choose a point from Dı and a point from 
Da; create a query point midway between these two points. Label the point 
according to the oracle and add it to the appropriate D;. Continue until the 
total number of labeled points is 100. Now create a nearest-neighbor classifier 
using all points, and plot the decision boundary in two dimensions. 


(d) Compare qualitatively your classifiers from parts (a), (b) & (c), and discuss your 
results. 


@ Section 9.6 


5. Explore a case where cross validation need not yield and improved classifier. 
Throughout, the classifier will be k-nearest-neighbor (Chap. ??), where k will be 
set by cross validation. two-category problem in two dimensions, with uniform prior 
distributions throughout the range 0 < x; < 1 fori = 1,2. 
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(a) First form a third, test set Drest of 20 points — 10 points in w and 10 in wa — 
randomly chosen according to a uniform distribution. 


(b) Next generate 100 points — 50 patterns in each category. Set y = 0.1 and split 
this set into a training set Dirain (90 points) and validation set D,,; (10 points). 


(c) Now form a k-nearest-neighbor classifier in which k is increased until the first 
minimum the validation error is found. (Restrict k to odd values, to avoid ties.) 
Now determine the error of your classifier using the test set. 


(d) Repeat part (c), but instead find the k that is the first maximum of the validation 
error. 


(e) Repeat parts (c) & (d) five times, noting the test error in all ten cases. 


(£) Discuss your results, in particular how they depend or do not depend upon the 
fact that the data were all uniformly distributed. 


6. Consider three candidate one-dimensional distributions each parameterized by an 
unknown value for its “center”: 


e Gaussian: p(x) ~ N(p,07) 


for |x — | <1 


, 1 — |z — 
e Triangle: p(x) ~ T(u,1) = { 0 | A otherwise 


e Uniform: p(x) =U(u—2,4+2). 


Suppose we are sure that for each model the center must lie in the range —1 < u < 1, 
and for the Gaussian that 0 < o? < 1. Suppose too that we are given the data 
D = [-.9,—.1,0.,.1,.9). Clearly, the maximum likelihood solution ĝ = 0 applies to 
each model. 


(a) Estimate the Occam factor in each case. 
(b) Use Bayesian model selection to choose the best of these models. 


7. Problem 38 describes Bertrand’s paradox, which involves the probability that a 
circle’s chord “randomly chosen” will have length greater than that of an inscribed 
equilateral triangle. 


(a 


<= 


Write a program to generate chords according to the logic of solution (1) in 
Problem 38. Generate 1000 such chords and estimate empirically the probability 
that a chord has length greater than that of an inscribed equilateral triangle. 


(b) Repeat part (a) assuming the logic underlying solution (2). 
(c) Repeat part (a) assuming the logic underlying solution (3). 


(d) Explain why there is little or no reason to prefer one of the solution methods 
over another, and thus the solution to the stated problem is ill-defined. 


(e) Relate your answers above to the No Free Lunch Theorem (Theorem 9.1) and 
Bayesian model selection. 


62 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING 


@ Section 9.7 


8. Create a multiclassifier system for the data in the table above. As in Computer 
exercise 3, define two super-categories, where the twenty points in w and wa form 
one category, wa, and the remaining twenty points form wp. 


(a) Let the first component classifier be based on Gaussian priors, where the mean 
p, is arbitrary and the covariance is estimated by maximum likelihood (Chap. ??). 
What is training error measured using wa and wg? 


(b) Let the second component classifier also be based on Gaussian priors, but where 
the covariance is arbitrary. 


(c) Train your two-component classifier by gradient descent (Eqs. 58 & 59). What 
is training error of the full system? 


(d) Discuss your answers. 
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Chapter 10 


Unsupervised Learning and 
Clustering 


10.1 Introduction 


ntil now we have assumed that the training samples used to design a classifier were 

labeled by their category membership. Procedures that use labeled samples are 
said to be supervised. Now we shall investigate a number of unsupervised procedures, 
which use unlabeled samples. That is, we shall see what can be done when all one 
has is a collection of samples without being told their category. 

One might wonder why anyone is interested in such an unpromising problem, and 
whether or not it is possible even in principle to learn anything of value from un- 
labeled samples. There are at least five basic reasons for interest in unsupervised 
procedures. First, collecting and labeling a large set of sample patterns can be sur- 
prisingly costly. For instance, recorded speech is virtually free, but accurately labeling 
the speech — marking what word or phoneme is being uttered at each instant — 
can be very expensive and time consuming. If a classifier can be crudely designed on 
a small set of labeled samples, and then “tuned up” by allowing it to run without 
supervision on a large, unlabeled set, much time and trouble can be saved. Second, 
one might wish to proceed in the reverse direction: train with large amounts of (less 
expensive) unlabeled data, and only then use supervision to label the groupings found. 
This may be appropriate for large “data mining” applications where the contents of 
a large database are not known beforehand. Third, in many applications the charac- 
teristics of the patterns can change slowly with time, for example in automated food 
classification as the seasons change. If these changes can be tracked by a classifier 
running in an unsupervised mode, improved performance can be achieved. Fourth, 
we can use unsupervised methods to find features, that will then be useful for cate- 
gorization. There are unsupervised methods that represent a form of data-dependent 
“smart preprocessing” or “smart feature extraction.” Lastly, in the early stages of 
an investigation it may be valuable to gain some insight into the nature or structure 
of the data. The discovery of distinct subclasses or similarities among patterns or of 
major departures from expected characteristics may suggest we significantly alter our 
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approach to designing the classifier. 

The answer to the question of whether or not it is possible in principle to learn 
anything from unlabeled data depends upon the assumptions one is willing to accept 
— theorems can not be proved without premises. We shall begin with the very restric- 
tive assumption that the functional forms for the underlying probability densities are 
known, and that the only thing that must be learned is the value of an unknown pa- 
rameter vector. Interestingly enough, the formal solution to this problem will turn out 
to be almost identical to the solution for the problem of supervised learning given in 
Chap. ??. Unfortunately, in the unsupervised case the solution suffers from the usual 
problems associated with parametric assumptions without providing any of the bene- 
fits of computational simplicity. This will lead us to various attempts to reformulate 
the problem as one of partitioning the data into subgroups or clusters. While some of 
the resulting clustering procedures have no known significant theoretical properties, 
they are still among the more useful tools for pattern recognition problems. 


10.2 Mixture Densities and Identifiability 


We begin by assuming that we know the complete probability structure for the prob- 
lem with the sole exception of the values of some parameters. To be more specific, we 
make the following assumptions: 


1. The samples come from a known number c of classes. 
2. The prior probabilities P(w,;) for each class are known, j = 1,...,c. 


3. The forms for the class-conditional probability densities p(x|w;,0;) are known, 
P= ere or 


4. The values for the c parameter vectors 61,...,0. are unknown. 


5. The category labels are unknown. 


Samples are assumed to be obtained by selecting a state of nature w; with prob- 
ability P(w;) and then selecting an x according to the probability law p(x|w,, 6;). 
Thus, the probability density function for the samples is given by 


p(x/0) = X` p(x|w;, 05) Py), (1) 
j=l 
where 0 = (0,,...,0.). For obvious reasons, a density function of this form is called 


a mixture density. The conditional densities p(x|w;,0;) are called the component 
densities, and the prior probabilities P(w;) are called the mixing parameters. The 
mixing parameters can also be included among the unknown parameters, but for the 
moment we shall assume that only @ is unknown. 

Our basic goal will be to use samples drawn from this mixture density to estimate 
the unknown parameter vector 0. Once we know 0 we can decompose the mixture 
into its components and use a Bayesian classifier on the derived densities, if indeed 
classification is our final goal. Before seeking explicit solutions to this problem, how- 
ever, let us ask whether or not it is possible in principle to recover 0 from the mixture. 
Suppose that we had an unlimited number of samples, and that we used one of the 
nonparametric methods of Chap. ?? to determine the value of p(x|@) for every x. If 
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there is only one value of @ that will produce the observed values for p(x|@), then 
a solution is at least possible in principle. However, if several different values of 0 
can produce the same values for p(x|@), then there is no hope of obtaining a unique 
solution. 

These considerations lead us to the following definition: a density p(x|@) is said 
to be identifiable if O A 6’ implies that there exists an x such that p(x]0) 4 p(x|@’). 
Or put another way, a density p(x|@) is not identifiable if we cannot recover a unique 
0, even from an infinite amount of data. In the discouraging situation where we 
cannot infer any of the individual parameters (i.e., components of 0), the density 
is completely unidentifiable.* Note that the identifiability of O is a property of the 
model, irrespective of any procedure we might use to determine its value. As one might 
expect, the study of unsupervised learning is greatly simplified if we restrict ourselves 
to identifiable mixtures. Fortunately, most mixtures of commonly encountered density 
functions are identifiable, as are most complex or high-dimensional density functions 
encountered in real-world problems. 

Mixtures of discrete distributions are not always so obliging. As a simple example 
consider the case where x is binary and P(2|@) is the mixture 


1 A 
P(z|0) = 5% a Pap ea- 82) =? 


Suppose, for example, that we know for our data that P(x = 1|0) = 0.6, and hence 
that P(x = 0/0) = 0.4. Then we know the function P(x|0), but we cannot determine 
0, and hence cannot extract the component distributions. The most we can say is 
that 01, +02 = 1.2. Thus, here we have a case in which the mixture distribution is com- 
pletely unidentifiable, and hence a case for which unsupervised learning is impossible 
in principle. Related situations may permit us to determine one or some parameters, 
but not all (Problem 3). 

This kind of problem commonly occurs with discrete distributions. If there are 
too many components in the mixture, there may be more unknowns than independent 
equations, and identifiability can be a serious problem. For the continuous case, 
the problems are less severe, although certain minor difficulties can arise due to the 
possibility of special cases. Thus, while it can be shown that mixtures of normal 
densities are usually identifiable, the parameters in the simple mixture density 


P(w1) 1 


P(w2) 1 
Von o | 2 


(x DUE | z 02) | (2) 


cannot be uniquely identified if P(w1) = P(w2), for then 0, and 02 can be interchanged 
without affecting p(x|0). To avoid such irritations, we shall acknowledge that identi- 
fiability can be a problem, but shall henceforth assume that the mixture densities we 
are working with are identifiable. 


p(2|8) = 


* Technically speaking, a distribution is not identifiable if we cannot determine the parameters 
without bias. We might guess their correct values, but such a guess would have to be biased in 
some way. 


COMPLETE 
UNIDENTIFI- 
ABILITY 
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10.3 Maximum-Likelihood Estimates 


Suppose now that we are given a set D = {x,...,X,} of n unlabeled samples drawn 
independently from the mixture density 


p(x|0) = X p(x|w;,0;)P(wy), (1) 


where the full parameter vector @ is fixed but unknown. The likelihood of the observed 
samples is, by definition, the joint density 


p(D\@) = | [ p10). (3) 
k=1 


The maximum-likelihood estimate Ê is that value of @ that maximizes p(D|0). 

If we assume that p(D|@) is a differentiable function of 0, then we can derive some 
interesting necessary conditions for Ô. Let l be the logarithm of the likelihood, and 
let Vo! be the gradient of l with respect to 0,. Then 


l= $ In p(x:10) (4) 
k=1 
and 
n 1 a 
Va l= —— VA. p(xxw;,0¡)P(w;)| . 5 
O, 2 lð) 0; D (xz lw;, 0;)P(w;) (5) 


If we assume that the elements of 0; and 0; are functionally independent if i 4 j, and 
if we introduce the posterior probability 


p(xglw;, 0i) P(w) 
ple) 6) 


we see that the gradient of the log-likelihood can be written in the interesting form 


P(w;|xp,0) = 


Vo, ! = Y PWwilkk, 9)Vg,In p(xplw;, 0). (7) 
k=1 
Since the gradient must vanish at the value of 0; that maximizes l, the maximum- 
likelihood estimate 0, must satisfy the conditions 


n 


Y P(wilxe, 0) V g, In p(xglw;,0;) =0, ¿i=1,...,c. (8) 
k=1 


Among the solutions to these equations for 6; we may find the maximum-likelihood 
solution. 

It is not hard to generalize these results to include the prior probabilities P(w;) 
among the unknown quantities. In this case the search for the maximum value of 
p(D|@) extends over 0 and P(w;), subject to the constraints 


PO) SO) GST age (9) 
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and 


> Plus) =1. (10) 


Let P(w;) be the maximum-likelihood estimate for P(w;), and let 6; be the maximum- 
likelihood estimate for 6,. It can be shown (Problem ??) that if the likelihood function 
is differentiable and if P(w;) 4 0 for any i, then P(w;) and 6, must satisfy 


7 es & 4 
P(w;) = — i 
(wi) = D Ploile 9) (11) 
k=1 
and 
P(wilxe, ô)Vo, In P(XK|W;, 6;) = = 0, (12) 
k=1 
where 
Pais, 6) = PE POPs) (13) 


p(xlw;, 0) P (w) | 


Ms 


dl 


J 


These equations have the following interpretation. Equation 11 states that the 
maximum-likelihood estimate of the probability of a category is the average over the 
entire data set of the estimate derived from each sample — each sample is weighted 
equally. Equation 13 is ultimately related to Bayes Theorem, but notice that in 
estimating the probability for class w;, the numerator on the right-hand side depends 
on ĝ; and not the full @ directly. While Eq. 12 is a bit subtle, we can understand 
it clearly in the trivial n = 1 case. Since Ê ¥ 0, this case states merely that the 
probability density is maximized as a function of 0; — surely what is needed for the 
maximum-likelihood solution. 


10.4 Application to Normal Mixtures 


It is enlightening to see how these general results apply to the case where the compo- 
nent densities are multivariate normal, p(x|w;,0;) ~ N(p,, 2;). The following table 
illustrates a few of the different cases that can arise depending upon which parameters 
are known (x) and which are unknown (?): 


Case || uw; | E; | Pwi) | c 
il ? x x x 
2 ? |? ? x 
3 ? |? ? ? 


Case 1 is the simplest, and will be considered in detail because of its pedagogical 
value. Case 2 is more realistic, though somewhat more involved. Case 3 represents the 
problem we face on encountering a completely unknown set of data; unfortunately, it 
cannot be solved by maximum -likelihood methods. We shall postpone discussion of 
what can be done when the number of classes is unknown until Sect. ??. 
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10.4.1 Case 1: Unknown Mean Vectors 


If the only unknown quantities are the mean vectors p1,, then of course 0, consists of 
the components of p,. Equation 8 can then be used to obtain necessary conditions 
on the maximum-likelihood estimate for p1,. Since the likelihood is 


1 _ 
In plis, pi) = In [E PIE 2] - a Bw), 4) 


its derivative is 


V y, ln p(x|wi, pi) = E; (x — M4). (15) 


Thus according to Eq. 8, the maximum-likelihood estimate f4; must satisfy 


XO P(wilxn, AEF (xr — Pi) =0, where Ê= (ĝis; Êe). (16) 
k=1 
After multiplying by X; and rearranging terms, we obtain the solution: 


L P(wi|Xk, [Xx 


i (17) 
Pelos À) 
k=1 

This equation is intuitively very satisfying. It shows that the maximum -likelihood 
estimate for u; is merely a weighted average of the samples; the weight for the kth 
sample is an estimate of how likely it is that x; belongs to the ith class. If P(w;|xx, (4) 
happened to be 1.0 for some of the samples and 0.0 for the rest, then fe, would be the 
mean of those samples estimated to belong to the ith class. More generally, suppose 
that f, is sufficiently close to the true value of u; that P(w;lxy, À) is essentially 
the true posterior probability for w;. If we think of P(w;|xz, (4) as the fraction of 
those samples having value xz that come from the ith class, then we see that Eq. 17 
essentially gives f1, as the average of the samples coming from the ¿th class. 

Unfortunately, Eq. 17 does not give fì; explicitly, and if we substitute 


aia eae 
2 palio, j;)P(w;) 


j= 


with p(xļ|wi, fi) ~ N(f;, Xi), we obtain a tangled snarl of coupled simultaneous 
nonlinear equations. These equations usually do not have a unique solution, and we 
must test the solutions we get to find the one that actually maximizes the likelihood. 
If we have some way of obtaining fairly good initial estimates f;(0) for the unknown 
means, Eq. 17 suggests the following iterative scheme for improving the estimates: 


Y P(wilxe, (7) 
pirga (18) 
È Ploi AA) 


This is basically a gradient ascent or hill-climbing procedure for maximizing the log- 
likelihood function. If the overlap between component densities is small, then the 
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coupling between classes will be small and convergence will be fast. However, when 
convergence does occur, all that we can be sure of is that the gradient is zero. Like all 
hill-climbing procedures, this one carries no guarantee of yielding the global maximum 
(Computer exercise 19). Note too that if the model is mis-specified (for instance we 
assume the “wrong” number of clusters) then the log-likelihood can actually decrease 
(Computer exercise 21). 


Example 1: Mixtures of two 1D Gaussians | 


To illustrate the kind of behavior that can occur, consider the simple two-component 
one-dimensional normal mixture: 


(ola a) = ¿yoo | (0 + on | (e 1) 
11, = aX, x | X x F 
pie pa, Ma 327 P > Hı 32n P 2 H2 
eae 
wi wa 


where w; denotes a Gaussian component. The 25 samples shown in the table were 
drawn sequentially from this mixture with y = —2 and u2 = 2. Let us use these 
samples to compute the log-likelihood function 


lu, H2) = 5 In p(zp|ui, H2) 
k=1 


for various values of j1; and u2. The bottom figure shows how / varies with pı and u2. 
The maximum value of l occurs at ĝı = —2.130 and fig = 1.668, which is in the rough 
vicinity of the true values 1 = —2 and u2 = 2. However, l reaches another peak of 
comparable height at (11 = 2.085 and fig = —1.257. Roughly speaking, this solution 
corresponds to interchanging 4, and u2. Note that had the prior probabilities been 
equal, interchanging 4, and u would have produced no change in the log-likelihood 
function. Thus, as we mentioned before, when the mixture density is not identifiable, 
the maximum-likelihood solution is not unique. 


k Tk Wy wa k Tk Wy wa k Tk Wy wa 

1 | 0.608 x 9 0.262 x 17 | -3.458 | x 

2 | -1.590 | x 10 | 1.072 x 18 | 0.257 x 

3 | 0.235 x 11 | -1.773 | x 19 | 2.569 x 

4 | 3.949 x 12 | 0.537 x 20 | 1.415 x 

5 | -2.249 | x 13 | 3.240 x 21 | 1.410 x 

6 | 2.704 x 14 | 2.400 x 22 | -2.653 | x 

7 | -2.473 | x 15 | -2.499 | x 23 | 1.396 x 

8 | 0.672 x 16 | 2.608 x 24 | 3.286 x 
25 | -0.712 | x 


Additional insight into the nature of these multiple solutions can be obtained by 
examining the resulting estimates for the mixture density. The figure at the top 
shows the true (source) mixture density and the estimates obtained by using the two 
maximum-likelihood estimates as if they were the true parameter values. The 25 
sample values are shown as a scatter of points along the abscissa — wı points in 
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black, wə points in red. Note that the peaks of both the true mixture density and 
the maximum-likelihood solutions are located so as to encompass two major groups 
of data points. The estimate corresponding to the smaller local maximum of the log- 
likelihood function has a mirror-image shape, but its peaks also encompass reasonable 
groups of data points. To the eye, neither of these solutions is clearly superior, and 
both are interesting. 


palu) 7 ae Palha- 


(Above) The source mixture density used to generate sample data, and two maximum- 
likelihood estimates based on the data in the table. (Bottom) Log-likelihood of a 
mixture model consisting of two univariate Gaussians as a function of their means, 
for the data in the table. Trajectories for the iterative maximum -likelihood estimation 
of the means of a two-Gaussian mixture model based on the data are shown as red 
lines. Two local optima (with log-likelihoods -52.2 and -56.7) correspond to the two 
density estimates shown above. 


If Eq. 18 is used to determine solutions to Eq. 17 iteratively, the results depend 
on the starting values ĝı(0) and fi2(0). The bottom figure shows trajectories from 
two different starting points. Although not shown, if ¡11 (0) = fia(0), convergence 
to a saddle point occurs in one step. This is not a coincidence; it happens for the 
simple reason that for this starting point P(w;lzz, (0), Ai(0)) = P(w;). In such a 
case Eq. 18 yields the mean of all of the samples for ¡1 and fig for all successive 
iterations. Clearly, this is a general phenomenon, and such saddle-point solutions can 
be expected if the starting point does not bias the search away from a symmetric 
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answer. 


10.4.2 Case 2: All Parameters Unknown 


If u;i, Xi, and P(w;) are all unknown, and if no constraints are placed on the covariance 
matrix, then the maximum-likelihood principle yields useless singular solutions. The 
reason for this can be appreciated from the following simple example in one dimension. 
Let p(x|,07) be the two-component normal mixture: 


E 1 r l T=] + 1 sl 1 2] 
Tio J= X Xp | — 32%|. 
iS 2/210 2\ a 2/27 úl 2 


The likelihood function for n samples drawn from this probability density is merely 
the product of the n densities p(x¡|u,0?). Suppose that we let u = x1, the value of 
the first sample. In this situation the density is 


TE E E | | 
xlu,0?) = . 
ASUA 2V 270 2 2T d 2 


Clearly, for the rest of the samples 


p(=;|u, 0?) > mee a, 


so that 


A SZ ep | iaar p |- Dal: 


Thus, the first term at the right shows that by letting o approach zero we can make 
the likelihood arbitrarily large, and the maximum -likelihood solution is singular. 

Ordinarily, singular solutions are of no interest, and we are forced to conclude that 
the maximum -likelihood principle fails for this class of normal mixtures. However, it 
is an empirical fact that meaningful solutions can still be obtained if we restrict our 
attention to the largest of the finite local maxima of the likelihood function. Assuming 
that the likelihood function is well behaved at such maxima, we can use Eqs. 11 — 
13 to obtain estimates for w;, Xi, and P(w;). When we include the elements of X; 
in the elements of the parameter vector 0;, we must remember that only half of the 
off-diagonal elements are independent. In addition, it turns out to be much more 
convenient to let the independent elements of ag rather than %; be the unknown 
parameters. With these observations, the actual differentiation of 


jE a yal 
In p(xglws, 0;) = In (277) 1/2 5 (Xk Hi) 2; (Xk = Hi) 
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with respect to the elements of p; and Y; * is relatively routine. Let 2,(k) be the pth 
element of xx, Hp(i) be the pth element of ui, pq(i) be the pgth element of 2;, and 
a?4(i) be the pgth element of E; *. Then differentiation gives 


V y, ln p(xg ju, 0i) = E; (xk — 1) 


and 


PE poaki — (1 e) foral) = (al) = Neal) = lD), 


where dpq is the Kronecker delta. We substitute these results in Eq. 12 and perform a 


small amount of algebraic manipulation (Problem 16) and thereby obtain the following 
equations for the local-maximum-likelihood estimate ft, X;, and P(w;): 


A 
2 Je 1 
= P(w;|xp., 0 (19) 


i f (20) 


where 


P(w;|xp,0) = 


_ [SP exp[ — 3 (xe — fa) "(xe — A)|] Plwos) (22) 
Y 18,1-1/2 exp[ — $ (xx — fa, "(xn — A)|] P (ws) 


j=l 


While the notation may make these equations appear to be rather formidable, 
their interpretation is actually quite simple. In the extreme case where P(w;|xp, 0) is 
1.0 when x, is from Class w; and 0.0 otherwise, P (w;) is the fraction of samples from 
Wi, ft; is the mean of those samples, and Y, is the corresponding sample covariance 
matrix. More generally, P(w;|x,,0) is between 0.0 and 1.0, and all of the samples 
play some role in the estimates. However, the estimates are basically still frequency 
ratios, sample means, and sample covariance matrices. 

The problems involved in solving these implicit equations are similar to the prob- 
lems discussed in Sect. ??, with the additional complication of having to avoid singular 
solutions. Of the various techniques that can be used to obtain a solution, the most 
obvious approach is to use initial estimates to evaluate Eq. 22 for P(w;|x,, 0) and then 
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to use Eqs. 19 — 21 to update these estimates. If the initial estimates are very good, 
having perhaps been obtained from a fairly large set of labeled samples, convergence 
can be quite rapid. However, the results do depend upon the starting point, and the 
problem of multiple solutions is always present. Furthermore, the repeated computa- 
tion and inversion of the sample covariance matrices can be quite time consuming. 

Considerable simplification can be obtained if it is possible to assume that the 
covariance matrices are diagonal. This has the added virtue of reducing the number 
of unknown parameters, which is very important when the number of samples is 
not large. If this assumption is too strong, it still may be possible to obtain some 
simplification by assuming that the c covariance matrices are equal, which also may 
eliminate the problem of singular solutions (Problem 16). 


10.4.3 K-means clustering 


Of the various techniques that can be used to simplify the computation and acceler- 
ate convergence, we shall briefly consider one elementary, approximate method. From 
Eq. 22, it is clear that the probability P(wi|Xx, 6) is large when the squared Maha- 
lanobis distance (xy — A) $7! (xp — fa,) is small. Suppose that we merely compute 
the squared Euclidean distance ||x; — f1;||?, find the mean Êm nearest to Xy, and 
approximate P(w;|x,, Ê) as 


1 ifi=m 


P(w;|xp, 0) = { 0 otherwise. (23) 


Then the iterative application of Eq. 20 leads to the following procedure for finding 
ft)... , Ae (Although the algorithm is historically referred to as k-means clustering, 


we retain the notation c, our symbol for the number of clusters.) 


Algorithm 1 (K-means clustering) 


1 begin initialize n, c, M,,Mo,..., Me 

2 do classify n samples according to nearest 4; 
3 recompute 1, 

4 until no change in p; 

5 return H], Ho,.--, He 

6 end 


The computational complexity of the algorithm is O(ndcT) where d the number of 
features and T the number of iterations (Problem 15). In practice, the number of 
iterations is generally much less than the number of samples. 

This is typical of a class of procedures that are known as clustering procedures or 
algorithms. Later on we shall place it in the class of iterative optimization procedures, 
since the means tend to move so as to minimize a squared-error criterion function. For 
the moment we view it merely as an approximate way to obtain maximum -likelihood 
estimates for the means. The values obtained can be accepted as the answer, or can 
be used as starting points for the more exact computations. 

It is interesting to see how this procedure behaves on the example data we saw 
in Example 1. Figure 10.1 shows the sequence of values for fi; and fío obtained for 
several different starting points. Since interchanging fi, and fig merely interchanges 
the labels assigned to the data, the trajectories are symmetric about the line (11 = fiz. 
The trajectories lead either to the point (11 = —2.176, fig = 1.684 or to its symmetric 
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Figure 10.1: The k-means clustering procedure is a form of stochastic hill climbing 
in the log-likelihood function. The contours represent equal log-likelihood values for 
the one-dimensional data in Example 1. The dots indicate parameter values after 
different iterations of the k-means algorithm. Six of the starting points shown lead to 
local maxima, whereas two (i.e., 11(0) = 2(0)) lead to a saddle point near ps = 0. 


image. This is close to the solution found by the maximum-likelihood method (viz., 
ji, = —2.130 and fig = 1.688), and the trajectories show a general resemblance to 
those shown in Example 1. In general, when the overlap between the component 
densities is small the maximum-likelihood approach and the k-means procedure can 
be expected to give similar results. 

Figure 10.2 shows a two-dimensional example, with the assumption of c = 3 clus- 
ters. The three initial cluster centers, chosen randomly from the training points, and 
their associated Voronoi tesselation, are shown in pink. According to the algorithm, 
the points in each of the three Voronoi cells are used to calculate new cluster centers 
(dark pink), and so on. Here, after the third iteration the algorithm has converged 
(red). Because the k-means algorithm is very simple and works well in practice, it is 
a staple of clustering methods. 


10.4.4 *Fuzzy k-means clustering 


In every iteration of the classical k-means procedure, each data point is assumed to 
be in exactly one cluster, as implied by Eq. 23 and by lines 2 & 3 of Algorithm 1. 
We can relax this condition and assume that each sample x; has some graded or 
“fuzzy” cluster membership ;(x;) in cluster w;, where 0 < p;(x;) < 1. At root, 
these “memberships” are equivalent to the probabilities P(wi|x;, 4) given in Eq. 22, 
and thus we use this symbol. In the resulting fuzzy k-means clustering algorithm we 
seek a minimum of a global cost function 


L=) X [Ploi A lls ll, (24) 


i=1 j=1 


where b > 1 is a free parameter chosen to adjust the “blending” of different clusters. 
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Figure 10.2: Trajectories for the means of the k-means clustering procedure applied to 
two-dimensional data. The final Voronoi tesselation (for classification) is also shown 
— the means correspond to the “centers” of the Voronoi cells. 


If b is set to 0, this criterion function is merely a sum-of-squared errors criterion we 
shall see again in Eq. 49. The probabilities of cluster membership for each point are 
normalized as 


Sree. Jl (25) 
i=l 
At the solution, i.e., the minimum of L, we have 


OL/0p,=0 and OL/AaP; =0, (26) 


and these lead (Problem 14) to the conditions 


M= 


P(wil 5); 


= (27) 
 [P(wilx,)]? 


and 
(1/d;)¥C-) 


P(wi|x;) =e > 
È (/drg MOD 
FSi 


dij = ||x; — mll’. (28) 


In general, the criterion is minimized when the cluster centers w; are near those 
points that have high estimated probability of being in cluster j. Since Eqs 27 & 28 
rarely have analytic solutions, the cluster means and point probabilities are estimated 
iteratively according to the following algorithm: 


Algorithm 2 (Fuzzy k-means clustering) 
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Figure 10.3: At each iteration of the fuzzy k-means clustering algorithm, the prob- 
ability of category memberships for each point are adjusted according to Eqs. 27 & 
28 (here b = 2). While most points have non-negligible memberships in two or three 
clusters, we nevertheless draw the boundary of a Voronoi tesselation to illustrate the 
progress of the algorithm. After four iterations, the algorithm has converged and the 
red cluster centers and associated Voronoi tesselation would be used for assigning new 
points to clusters. 


1 begin initialize n, 41,..., Me, P(wi | xj) i =1...,6; 7=1,...,n 
2 normalize proabilities of cluster memberships by Eq. 25 
3 do classify n samples according to nearest 4; 

4 recompute u; by Eq. 27 

5 recompute P(w; | xj) by Eq. 28 
6 
7 
8 


until no change in p; and P(w; | x;) 


return Hy, Ho,- , He 
end 


Figure 10.3 illustrates the algorithm. At early iterations the means lie near the center 
of the full data set because each point has a non-negligible “membership” (i.e., prob- 
ability) in each cluster. At later iterations the means separate and each membership 
tends toward the value 1.0 or 0.0. Clearly, the classical k-means algorithm is just of 
special case where the memberships for all points obey 


af 1 ll ell < ly — say for allo 
Phapa = { 0 otherwise, (29) 


as given by Eq. 17. 

While it may seem that such graded membership might improve convergence of 
k-means over its classical counterpart, in practice there are several drawbacks to the 
fuzzy method. One is that according to Eq. 25 the probability of “membership” of 
a point x; in a cluster 7 depends implicitly on the number of clusters, and when 
the number of clusters is specified incorrectly, serious problems may arise (Computer 
exercise 4). 
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10.5 Unsupervised Bayesian Learning 


10.5.1 The Bayes Classifier 


As we saw in Chap. ??, maximum-likelihood methods do not assume the parameter 
vector 0 to be random — it is just unknown. In such methods, prior knowledge about 
the likely values for O is not directly relevant, although in practice such knowledge 
may be used in choosing good starting points for hill-climbing procedures. In this 
section, however, we shall take a Bayesian approach to unsupervised learning. That 
is, we shall assume that 0 is a random variable with a known prior distribution p(6), 
and we shall use the samples to compute the posterior density p(@|D). Interestingly 
enough, the analysis will closely parallel the analysis of supervised Bayesian learning 
(Sect. ??.??), showing that the two problems are formally very similar. 
We begin with an explicit statement of our basic assumptions. We assume that 


1. The number of classes c is known. 
2. The prior probabilities P(w,;) for each class are known, j = 1,...,c. 


3. The forms for the class-conditional probability densities p(x|w,;,@;) are known, 
j=1,...,c, but the full parameter vector O = (6,,..., 6.) is not known. 


4. Part of our knowledge about 6 is contained in a known prior density p(@). 


5. The rest of our knowledge about 0 is contained in a set D of n samples X1,...,Xn 
drawn independently from the familiar mixture density 


p(x|8) = Y Palio, 05)P(w;). (30) 


At this point we could go directly to the calculation of p(@|D). However, let us 
first see how this density is used to determine the Bayes classifier. Suppose that a 
state of nature is selected with probability P(w;) and a feature vector x is selected 
according to the probability law p(x|w;,@;). To derive the Bayes classifier we must use 
all of the information at our disposal to compute the posterior probability P(w;|x). 
We exhibit the role of the samples explicitly by writing this as P(w;|x,D). By Bayes’ 
formula, we have 


P(u4[x,D) = Pee PPP) (31) 


Y p(x|w;,D)P(w;|D) 


Since the selection of the state of nature w; was done independently of the previously 
drawn samples, P(w;|D) = P(w;), and we obtain 


(32) 


Central to the Bayesian approach is the introduction of the unknown parameter 
vector 0 via 
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plxļwi D) = J POl D) dé 
= J P0, wi, D)olOoi, D) dé. (33) 


Since the selection of x is independent of the samples, we have p(x|@,w;,D) = 
p(x|w;,0;). Similarly, since knowledge of the state of nature when x is selected tells 
us nothing about the distribution of 8, we have p(@|w;,D) = p(@|D), and thus 


Plis, D) = | ploi, 8:)plOID) de. (34) 


That is, our best estimate of p(x|w;) is obtained by averaging p(x|w;,0;) over 6;. 
Whether or not this is a good estimate depends on the nature of p(@|D), and thus 
our attention turns at last to that density. 

10.5.2 Learning the Parameter Vector 


We can use Bayes’ formula to write 


p(D|0)p(0) 
p(01D) = 5, 35 
ADE Toja HO) de dd 
where the independence of the samples yields the likelihood 
p(D|0) = ] [ p10). (36) 
k=1 


Alternatively, letting D” denote the set of n samples, we can write Eq. 35 in the 
recursive form 


a p(xnlO)p(OID"-) 
PID") = Tinlo D) dO 


These are the basic equations for unsupervised Bayesian learning. Equation 35 
emphasizes the relation between the Bayesian and the maximum-likelihood solutions. 
If p(@) is essentially uniform over the region where p(D|@) peaks, then p(@|D) peaks 
at the same place. If the only significant peak occurs at O = 6, and if the peak is very 
sharp, then Eqs. 32 & 34 yield 


(37) 


p(x|wi,D) ~ p(x|wi, 0) (38) 
and 
p(x|w;, 0;)P(w;) 


Y Pla, 0;)P(w;) | 


P(w;|x, D) = (39) 


That is, these conditions justify the use of the maximum-likelihood estimate as if it 
were the true value of @ in designing the Bayes classifier. 

As we saw in Sect. ??.??, in the limit of large amounts of data, maximum-likelihood 
and the Bayes methods will agree (or nearly agree). While many small sample size 
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Figure 10.4: In a highly skewed or multiple peak posterior distribution such as illus- 
trated here, the maximum-likelihood solution 0 will yield a density very different from 
a Bayesian solution, which requires the integration over the full range of parameter 
space 0. 


problems they will agree, there exist small problems where the approximations are 
poor (Fig. 10.4). As we saw in the analogous case in supervised learning whether one 
chooses to use the maximum-likelihood or the Bayes method depends not only on how 
confident one is of the prior distributions, but also on computational considerations; 
maximum-likelihood techniques are often easier to implement than Bayesian ones. 

Of course, if p(@) has been obtained by supervised learning using a large set of 
labeled samples, it will be far from uniform, and it will have a dominant influence on 
p(0/D”) when n is small. Equation 37 shows how the observation of an additional 
unlabeled sample modifies our opinion about the true value of 6, and emphasizes the 
ideas of updating and learning. If the mixture density p(x|@) is identifiable, then 
each additional sample tends to sharpen p(6|D”), and under fairly general conditions 
p(0/D”) can be shown to converge (in probability) to a Dirac delta function centered 
at the true value of O (Problem 8). Thus, even though we do not know the categories 
of the samples, identifiability assures us that we can learn the unknown parameter 
vector 0, and thereby learn the component densities p(xlw;, 0). 

This, then, is the formal Bayesian solution to the problem of unsupervised learning. 
In retrospect, the fact that unsupervised learning of the parameters of a mixture 
density is so similar to supervised learning of the parameters of a component density 
is not at all surprising. Indeed, if the component density is itself a mixture, there 
would appear to be no essential difference between the two problems. 

There are, however, some significant differences between supervised and unsuper- 
vised learning. One of the major differences concerns the issue of identifiability. With 
supervised learning, the lack of identifiability merely means that instead of obtaining 
a unique parameter vector we obtain an equivalence class of parameter vectors. Be- 
cause all of these yield the same component density, lack of identifiability presents no 
theoretical difficulty. A lack of identifiability is much more serious in unsupervised 
learning. When @ cannot be determined uniquely, the mixture cannot be decomposed 
into its true components. Thus, while p(x|D”) may still converge to p(x), p(x|w;,D”) 
given by Eq. 34 will not in general converge to p(x|w;), and a theoretical barrier to 
learning exists. It is here that a few labeled training samples would be valuable: for 
“decomposing” the mixture into its components. 

Another serious problem for unsupervised learning is computational complexity. 
With supervised learning, the possibility of finding sufficient statistics allows solutions 
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that are analytically pleasing and computationally feasible. With unsupervised learn- 
ing, there is no way to avoid the fact that the samples are obtained from a mixture 
density, 


v(x/0) = roo) Plw), (1) 


and this gives us little hope of every finding simple exact solutions for p(@|D). Such 
solutions are tied to the existence of a simple sufficient statistic (Sect. ??.??), and the 
factorization theorem requires the ability to factor p(D|0) as 


p(D|0) = g(s, 0)N(D). (40) 
But from Eqs. 36 & 1, we see that the likelihood can be written as 


p(D|0) = TE [So reslos, 95) P(w;)). (41) 


k=1 j=l 


Thus, p(D|@) is the sum of c” products of component densities. Each term in this 
sum can be interpreted as the joint probability of obtaining the samples X,,...,Xn 
bearing a particular labeling, with the sum extending over all of the ways that the 
samples could be labeled. Clearly, this results in a thorough mixture of O and the x’s, 
and no simple factoring should be expected. An exception to this statement arises 
if the component densities do not overlap, so that as @ varies only one term in the 
mixture density is non-zero. In that case, p(D|@) is the product of the n nonzero 
terms, and may possess a simple sufficient statistic. However, since that case allows 
the class of any sample to be determined, it actually reduces the problem to one of 
supervised learning, and thus is not a significant exception. 

Another way to compare supervised and unsupervised learning is to substitute the 
mixture density for p(x,|@) in Eq. 37 and obtain 


È Ponle, 9) Pe) 
pep") = > p(o|D"-). (42) 
È J pls lo,.8,)P(u,)p(9[D"=1) 8 


If we consider the special case where P(w1) = 1 and all the other prior probabilities 
are zero, corresponding to the supervised case in which all samples come from Class 
w1, then Eq. 42 simplifies to 


P(Xn|wi, 01) 
J P(Xn|w1,81)p(8|D"-!) do 


Let us compare Eqs. 42 & 43 to see how observing an additional sample changes 
our estimate of 0. In each case we can ignore the normalizing denominator, which is 
independent of 6. Thus, the only significant difference is that in the supervised case 
we multiply the “prior” density for O by the component density p(x,,|w ,01), while 


p(0|D") = v(9|D"=>). (43) 


C 
in the unsupervised case we multiply it by the mixture density X` p(xn|w,;,0;)P(w;). 
Assuming that the sample really did come from Class w, we see that the effect of 
not knowing this category membership in the unsupervised case is to diminish the 
influence of xn on changing 0. Since xn could have come from any of the c classes, we 
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cannot use it with full effectiveness in changing the component(s) of O associated with 
any one category. Rather, we must distributed its effect over the various categories 
in accordance with the probability that it arose from each category. 


Example 2: Unsupervised learning of Gaussian data | 


As an example, consider the one-dimensional, two-component mixture with p(z|w1) ~ 
N(p, 1), p(alwe2,0) ~ N(0,1), where u, P(w1) and P(w2) are known. Here we have 


plal) = tex |-3- uP] + eter -36-0 
Y 12 "yr la] 


and we seek the mean of the second component. 

Viewed as a function of x, this mixture density is a superposition of two normal 
densities — one peaking at x = u and the other peaking at x = 6. Viewed as a 
function of 6, p(x|0) has a single peak at 6 = x. Suppose that the prior density p(@) 
is uniform from a to b. Then after one observation (x = 11) we have 


p(0|x1) =  ap(z1/0)p(0) 
a’ { P(wi)exp[—3 (#1 — 1) ]+ 
= P(w)exp[—$ (21 — 0)?]} a<O<b >, 
0 otherwise 


where a and a’ are normalizing constants that are independent of 0. If the sample 
x, is in the range a < x < b, then p(0|x1) peaks at 6 = 21, of course. Otherwise it 
peaks either at 0 =a if xı <a or at 0 = b if x; >b. Note that the additive constant 
exp [-(1/2)(21 — p)?] is large if xı is near u, and thus the peak of p(0|x1) is less 
pronounced if xı is near u. This corresponds to the fact that if xı is near p, it is 
more likely to have come from the p(x|w]) component, and hence its influence on our 
estimate for 0 is diminished. 
With the addition of a second sample x2, p(0|x1) changes to 


p(0/x1,12) = Bp(x2|6)p(O|x1) 
BY P(a1)P(w1 exp | a(t uy 2 (2 yu)" 
+[P(w1)P(w2)exp | pri py? a (wa 0)?| 
_ +[P(w2).P(w1)exp | a (21 0)? ¿(22 1)? 
+[P(w)P(wa)exp [5 (11 — 0)? — ¿(12 — 0)7]) 
a<0<b 
0 otherwise. 


Unfortunately, the primary thing we learn from this expression is that p(0|/D”) is 
already complicated when n = 2. The four terms in the sum correspond to the 
four ways in which the samples could have been drawn from the two component 
populations. With n samples there will be 2” terms, and no simple sufficient statistics 
can be found to facilitate understanding or to simplify computations. 

It is possible to use the relation 
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ay P(a|)p([D"-2) 
POP") tapa d 


and numerical integration to obtain an approximate numerical solution for p(9|[D”). 
This was done for the data in Example 1 using the values uy = 2, P(w1) = 1/3, and 
P(w2) = 2/3. A prior density p(9) uniform from —4 to +4 encompasses the data in 
the table. When this was used to start the recursive computation of p(0|D”), the 
results shown in the figure. As n goes to infinity we can confidently expect p(0|D”) 
to approach an impulse centered at 0 = 2. This graph gives some idea of the rate of 
convergence. 


PB | Xp oy X,) 


1.25 
| n=16 
1 f 


-4 2 0 2 4 
In unsupervised Bayesian learning of the parameter 0, the density becomes more 
peaked as the number of samples increases. The top figures uses a wide uniform prior 
p(0) = 1/8, -4< 0 < 4 while the bottom figure uses a narrower one, p(0) = 1/2,1 < 
0 < 3. Despite these different prior distributions, after all 25 samples have been used, 
the posterior densities are virtually identical in the two cases — the information in 
the samples overwhelms the prior information. 


One of the main differences between the Bayesian and the maximum-likelihood 
approaches to unsupervised learning appears in the presence of the prior density p(0). 
The figure shows how p(@|D”) changes when p(0) is assumed to be uniform from 1 to 
3, corresponding to more certain initial knowledge about 0. The results of this change 
are most pronounced when n is small. Tt is here (just as in the classification analog 
of Chap. ??) that the differences between the Bayesian and the maximum-likelihood 
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solutions are most significant. As n increases, the importance of prior knowledge 
diminishes, and in the particular case the curves for n = 25 are virtually identical. In 
general, one would expect the difference to be small when the number of unlabeled 
samples is several times the effective number of labeled samples used to determine 


p(@). 


10.5.3 Decision-Directed Approximation 


Although the problem of unsupervised learning can be stated as merely the problem of 
estimating parameters of a mixture density, neither the maximum-likelihood nor the 
Bayesian approach yields analytically simple results. Exact solutions for even the sim- 
plest nontrivial examples lead to computational requirements that grow exponentially 
with the number of samples (Problem ??). The problem of unsupervised learning is 
too important to abandon just because exact solutions are hard to find, however, and 
numerous procedures for obtaining approximate solutions have been suggested. 


Since the important difference between supervised and unsupervised learning is 
the presence or absence of labels for the samples, an obvious approach to unsuper- 
vised learning is to use the prior information to design a classifier and to use the 
decisions of this classifier to label the samples. This is called the decision-directed 
approach to unsupervised learning, and it is subject to many variations. It can be 
applied sequentially on-line by updating the classifier each time an unlabeled sample 
is classified. Alternatively, it can be applied in parallel (batch mode) by waiting un- 
til all n samples are classified before updating the classifier. If desired, this process 
can be repeated until no changes occur in the way the samples are labeled. Various 
heuristics can be introduced to make the extent of any corrections depend upon the 
confidence of the classification decision. 


There are some obvious dangers associated with the decision-directed approach. 
If the initial classifier is not reasonably good, or if an unfortunate sequence of samples 
is encountered, the errors in classifying the unlabeled samples can drive the classifier 
the wrong way, resulting in a solution corresponding roughly to one of the lesser 
peaks of the likelihood function. Even if the initial classifier is optimal, in general 
the resulting labeling will not be the same as the true class membership; the act of 
classification will exclude samples from the tails of the desired distribution, and will 
include samples from the tails of the other distributions. Thus, if there is significant 
overlap between the component densities, one can expect biased estimates and less 
than optimal results. 


Despite these drawbacks, the simplicity of decision-directed procedures makes the 
Bayesian approach computationally feasible, and a flawed solution is often better than 
none. If conditions are favorable, performance that is nearly optimal can be achieved 
at far less computational expense. In practice it is found that most of these procedures 
work well if the parametric assumptions are valid, if there is little overlap between 
the component densities, and if the initial classifier design is at least roughly correct 
(Computer exercise 7). 
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10.6 *Data Description and Clustering 


Let us reconsider our original problem of learning something of use from a set of 
unlabeled samples. Viewed geometrically, these samples may form clouds of points in 
a d-dimensional space. Suppose that we knew that these points came from a single 
normal distribution. Then the most we could learn form the data would be contained 
in the sufficient statistics — the sample mean and the sample covariance matrix. 
In essence, these statistics constitute a compact description of the data. The sample 
mean locates the center of gravity of the cloud; it can be thought of as the single point 
m that best represents all of the data in the sense of minimizing the sum of squared 
distances from m to the samples. The sample covariance matrix describes the amount 
the data scatters along various directions. If the data points are actually normally 
distributed, then the cloud has a simple hyperellipsoidal shape, and the sample mean 
tends to fall in the region where the samples are most densely concentrated. 

Of course, if the samples are not normally distributed, these statistics can give 
a very misleading description of the data. Figure 10.5 shows four different data sets 
that all have the same mean and covariance matrix. Obviously, second-order statistics 
are incapable of revealing all of the structure in an arbitrary set of data. 


Figure 10.5: These four data sets have identical statistics up to second-order, i.e., the 
same mean p and covariance X. In such cases it is important to include in the model 
more parameters to represent the structure more completely. 


If we assume that the samples come from a mixture of c normal distributions, 
we can approximate a greater variety of situations. In essence, this corresponds to 
assuming that the samples fall in hyperellipsoidally shaped clouds of various sizes 
and orientations. If the number of component densities is sufficiently high, we can 
approximate virtually any density function as a mixture model in this way, and use the 
parameters of the mixture to describe the data. Alas, we have seen that the problem 
of estimating the parameters of a mixture density is not trivial. Furthermore, in 
situations where we have relatively little prior knowledge about the nature of the 
data, the assumption of particular parametric forms may lead to poor or meaningless 
results. Instead of finding structure in the data, we would be imposing structure on 
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it. 

One alternative is to use one of the nonparametric methods described in Chap. ?? 
to estimate the unknown mixture density. If accurate, the resulting estimate is cer- 
tainly a complete description of what we can learn from the data. Regions of high 
local density, which might correspond to significant subclasses in the population, can 
be found from the peaks or modes of the estimated density. 

If the goal is to find subclasses, a more direct alternative is to use a clustering 
procedure. Roughly speaking, clustering procedures yield a data description in terms 
of clusters or groups of data points that possess strong internal similarities. Formal 
clustering procedures use a criterion function, such as the sum of the squared dis- 
tances from the cluster centers, and seek the grouping that extremizes the criterion 
function. Because even this can lead to unmanageable computational problems, other 
procedures have been proposed that are intuitively appealing but that lead to solu- 
tions having few if any established properties. Their use is usually justified on the 
ground that they are easy to apply and often yield interesting results that may guide 
the application of more rigorous procedures. 


10.6.1 Similarity Measures 


Once we describe the clustering problem as one of finding natural groupings in a set of 
data, we are obliged to define what we mean by a natural grouping. In what sense are 
we to say that the samples in one cluster are more like one another than like samples 
in other clusters? This question actually involves two separate issues: 


e How should one measure the similarity between samples? 


e How should one evaluate a partitioning of a set of samples into clusters? 


In this section we address the first of these issues. 

The most obvious measure of the similarity (or dissimilarity) between two samples 
is the distance between them. One way to begin a clustering investigation is to define 
a suitable distance function and compute the matrix of distances between all pairs 
of samples. If distance is a good measure of dissimilarity, then one would expect the 
distance between samples in the same cluster to be significantly less than the distance 
between samples in different clusters. 

Suppose for the moment that we say that two samples belong to the same cluster 
if the Euclidean distance between them is less than some threshold distance dy. It is 
immediately obvious that the choice of dy is very important. If dy is very large, all 
of the samples will be assigned to one cluster. If do is very small, each sample will 
form an isolated, singleton cluster. To obtain “natural” clusters, dy will have to be 
greater than the typical within-cluster distances and less than typical between-cluster 
distances (Fig. 10.6). 

Less obvious perhaps is the fact that the results of clustering depend on the choice 
of Euclidean distance as a measure of dissimilarity. That particular choice is generally 
justified if the feature space is isotropic and the data is spread roughly evenly along 
all directions. Clusters defined by Euclidean distance will be invariant to translations 
or rotations in feature space — rigid-body motions of the data points. However, they 
will not be invariant to linear transformations in general, or to other transformations 
that distort the distance relationships. Thus, as Fig. 10.7 illustrates, a simple scaling 
of the coordinate axes can result in a different grouping of the data into clusters. Of 
course, this is of no concern for problems in which arbitrary rescaling is an unnatural 
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Figure 10.6: The distance threshold affects the number and size of clusters. Lines are 
drawn between points closer than a distance dy apart for three different values of dy 
— the smaller the value of do, the smaller and more numerous the clusters. 


or meaningless transformation. However, if clusters are to mean anything, they should 
be invariant to transformations natural to the problem. 

One way to achieve invariance is to normalize the data prior to clustering. For 
example, to obtain invariance to displacement and scale changes, one might translate 
and scale the axes so that all of the features have zero mean and unit variance — 
standardize the data. To obtain invariance to rotation, one might rotate the axes so 
that they coincide with the eigenvectors of the sample covariance matrix. This trans- 
formation to principal components (Sect. 10.13.1) can be preceded and/or followed by 
normalization for scale. 

However, we should not conclude that this kind of normalization is necessarily 
desirable. Consider, for example, the matter of translating and whitening — scaling 
the axes so that each feature has zero mean and unit variance. The rationale usually 
given for this normalization is that it prevents certain features from dominating dis- 
tance calculations merely because they have large numerical values, much as we saw 
in networks trained with backpropagation (Sect. ??.??). Subtracting the mean and 
dividing by the standard deviation is an appropriate normalization if this spread of 
values is due to normal random variation; however, it can be quite inappropriate if the 
spread is due to the presence of subclasses (Fig. ??). Thus, this routine normalization 
may be less than helpful in the cases of greatest interest.* Section ?? describes other 
ways to obtain invariance to scaling. 

Instead of scaling axes, we can change the metric in interesting ways. For instance, 
one broad class of distance metrics is of the form 


d 1/q 
d(x,x') = ps [24 — sar) (44) 


k=1 


where q > 1 is a selectable parameter — the general Minkowski metric we considered 
in Chap. ??. Setting q = 2 gives the familiar Euclidean metric while setting q = 1 
the Manhattan or city block metric — the sum of the absolute distances along each 
of the d coordinate axes. Note that only q = 2 is invariant to an arbitrary rotation or 


* In backpropagation, one of the goals for such preprocessing and scaling of data was to increase 
learning speed; in contrast, such preprocessing does not significantly affect the speed of these 
clustering algorithms. 


10.6. *DATA DESCRIPTION AND CLUSTERING 27 


X2 
164 
X2 14 Ê 
14 
50 A 
ED? 12 L 
8 
p 
A N ary 
=> f 
i a 
e . 8 
A / x 
6 
2 
4 š 
0 > X . 
2 4 6 8 1 
2 
Gs o -x 
0.5 1 
x2 ES #4 3 
5 
4 
EF, 
3 -i VÁ 
; pon 7 
2 \ K 
J N 
0 1 4 + X] 
.25 5 .75 el 1.25 1.5 1.75 2 


Figure 10.7: Scaling axes affects the clusters in a minimum distance cluster method. 
The original data and minimum-distance clusters are shown in the upper left — points 
in one cluster are shown in red, the other gray. When the vertical axis is expanded 
by a factor of 2.0 and the horizontal axis shrunk by a factor of 0.5, the clustering is 
altered (as shown at the right). Alternatively, if the vertical axis is shrunk by a factor 
of 0.5 and the horizontal axis expanded by a factor of 2.0, smaller more numerous 
clusters result (shown at the bottom). In both these scaled cases, the clusters differ 
from the original. 


translation in feature space. Another alternative is to use some kind of metric based 
on the data itself, such as the Mahalanobis distance. 


More generally, one can abandon the use of distance altogether and introduce a 
nonmetric similarity function s(x, x’) to compare two vectors x and x’. Convention- 
ally, this is a symmetric functions whose value is large when x and x’ are somehow 
“similar.” For example, when the angle between two vectors is a meaningful measure 
of their similarity, then the normalized inner product 


a(x,x') = le] (45) 
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Figure 10.8: If the data fall into well-separated clusters (left), normalization by a 
whitening transform for the full data may reduce the separation, and hence be unde- 
sirable (right). Such a whitening normalization may be appropriate if the full data 
set arises from a single fundamental process (with noise), but inappropriate if there 
are several different processes, as shown here. 


may be an appropriate similarity function. This measure, which is the cosine of the 
angle between x and x’, is invariant to rotation and dilation, though it is not invariant 
to translation and general linear transformations. 

When the features are binary valued (0 or 1), this similarity functions has a simple 
non-geometrical interpretation in terms of shared features or shared attributes. Let 
us say that a sample x possesses the ith attribute if x; = 1. Then x*x” is merely the 
number of attributes possessed by both x and x’, and ||x|| ||x’|| = (x’xx’'x’)!/? is the 
geometric mean of the number of attributes possessed by x and the number possessed 
by x’. Thus, s(x,x’) is a measure of the relative possession of common attributes. 
Some simple variations are 


s(x, x’) = 7) (46) 
the fraction of attributes shared, and 


xx! 


aa) 


xix + xx! _ xtx!? (47) 
the ratio of the number of shared attributes to the number possessed by x or x”. This 
latter measure (sometimes known as the Tanimoto coefficient or Tanimoto distance) is 
frequently encountered in the fields of information retrieval and biological taxonomy. 
Related measures of similarity arise in other applications, the variety of measures 
testifying to the diversity of problem domains (Computer exercise ??). 

Fundamental issues in measurement theory are involved in the use of any distance 
or similarity function. The calculation of the similarity between two vectors always 
involves combining the values of their components. Yet in many pattern recognition 
applications the components of the feature vector measure seemingly noncomparable 
quantities, such as meters and kilograms. Recall our example of classifying fish: how 
can one compare the lightness of the skin to the length or weight of the fish? Should 
the comparison depend on whether the length is measured in meters or inches? How 
does one treat vectors whose components have a mixture of nominal, ordinal, interval 
and ratio scales? Ultimately, there are rarely clear methodological answers to these 
questions. When a user selects a particular similarity function or normalizes the data 
in a particular way, information is introduced that gives the procedure meaning. We 
have given examples of some alternatives that have proved to be useful. (Competitive 
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learning, discussed in Sect. 10.11, is a popular decision directed clustering algorithm.) 
Beyond that we can do little more than alert the unwary to these pitfalls of clustering. 

Amidst all this discussion of clustering, we must not lose sight of the fact that 
often the clusters found will later be labeled (e.g., by resorting to a teacher or small 
number of labeled samples), and that the clusters can then be used for classification. 
In that case, the same similarity (or metric) should be used for classification as was 
used for forming the clusters (Computer exercise 8). 


10.7 Criterion Functions for Clustering 


We have just considered the first major issue in clustering: how to measure “similar- 
ity.” Now we turn to the second major issue: the criterion function to be optimized. 
Suppose that we have a set D of n samples x1,...,Xp that we want to partition 
into exactly c disjoint subsets D,,...,D,-. Each subset is to represent a cluster, with 
samples in the same cluster being somehow more similar than samples in different 
clusters. One way to make this into a well-defined problem is to define a criterion 
function that measures the clustering quality of any partition of the data. Then the 
problem is one of finding the partition that extremizes the criterion function. In this 
section we examine the characteristics of several basically similar criterion functions, 
postponing until later the question of how to find an optimal partition. 


10.7.1 The Sum-of-Squared-Error Criterion 


The simplest and most widely used criterion function for clustering is the sum-of- 
squared-error criterion. Let n; be the number of samples in D; and let m; be the 
mean of those samples, 


m; = L 5 x. (48) 


G xED; 
Then the sum-of-squared errors is defined by 


e 


Je= Y) Y lx- ml. (49) 


i=1 xED; 


This criterion function has a simple interpretation: for a given cluster D;, the 
mean vector m; is the best representative of the samples in D; in the sense that it 
minimizes the sum of the squared lengths of the “error” vectors x — m; in D;. Thus, 
Je measures the total squared error incurred in representing the n samples X1,...,Xp 
by the c cluster centers m;,..., me. The value of J. depends on how the samples are 
grouped into clusters and the number of clusters; the optimal partitioning is defined 
as one that minimizes Je. Clusterings of this type are often called minimum variance 
partitions. 

What kind of clustering problems are well suited to a sum-of-squared-error crite- 
rion? Basically, Je is an appropriate criterion when the clusters form compact clouds 
that are rather well separated from one another. A less obvious problem arises when 
there are great differences in the number of samples in different clusters. In that case 
it can happen that a partition that splits a large cluster is favored over one that main- 
tains the integrity of the natural clusters, as illustrated in Fig. 10.9. This situation 
frequently arises because of the presence of “outliers” or “wild shots,” and brings up 
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the problem of interpreting and evaluating the results of clustering. Since little can 
be said about that problem, we shall merely observe that if additional considerations 
render the results of minimizing Je unsatisfactory, then these considerations should 
be used, if possible, in formulating a better criterion function. 


J, = large 


Je = small 


Figure 10.9: When two natural groupings have very different numbers of points, the 
clusters minimizing a sum-squared-error criterion (Eq. 49) may not reveal the true 
underlying structure. Here the criterion is smaller for the two clusters at the bottom 
than at the more natural clustering at the top. 


10.7.2 Related Minimum Variance Criteria 


By some simple algebraic manipulation (Problem 19) we can eliminate the mean 
vectors from the expression for Je and obtain the equivalent expression 


¡A 
Je = 2 2 NiSi, (50) 
where 


a= 5 Y lx. (51) 


xED, x/ED,; 


Equation 51 leads us to interpret 5; as the average squared distance between points in 
the ith cluster, and emphasizes the fact that the sum-of-squared-error criterion uses 
Euclidean distance as the measure of similarity. It also suggests an obvious way of 
obtaining other criterion functions. For example, one can replace 5; by the average, 
the median, or perhaps the maximum distance between points in a cluster. More 
generally, one can introduce an appropriate similarity function s(x, x’) and replace 5; 
by functions such as 


a= > 5 e) (52) 


i xED¡x"ED; 


or 


err ' 
5; ee xX): (53) 
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Table 10.1: Mean vectors and scatter matrices used in clustering criteria. 


Depend on 
cluster 
center? 
Yes No 
1 
Mean vector for m; = — x (54) 
the ith cluster x Thi Za 
1 LE 
a 22 Z im; 55 
Total mean vector x mes 5 q. 2 nim (55) 
Scatter matrix for 7 
the ith cluster i S; = a (x — m;)(x — mj) (56) 
xED; 
Within-cluster £ 
Sw = S; 57 
scatter matrix E w 2 (57) 
Between-cluster me c ' 
scatter matrix Sg = Y ni(m; — m)(m; — m)” (58) 
i=1 
T t 
Total scatter matrix x Sr = 5 (x — m)(x — m) (59) 
xED 


As in Chap. ??, we define an optimal partition as one that extremizes the crite- 
rion function. This creates a well-defined problem, and the hope is that its solution 
discloses the intrinsic structure of the data. 


10.7.3 Scattering Criteria 
The scatter matrices 


Another interesting class of criterion functions can be derived from the scatter matri- 
ces used in multiple discriminant analysis. The following definitions directly parallel 
those given in Chapt. ??. 

As before, it follows from these definitions that the total scatter matrix is the sum 
of the within-cluster scatter matrix and the between-cluster scatter matrix: 


Sr = Sw + Sp. (60) 


Note that the total scatter matrix does not depend on how the set of samples is par- 
titioned into clusters; it depends only on the total set of samples. The within-cluster 
and between-cluster scatter matrices taken separately do depend on the partitioning, 
of course. Roughly speaking, there is an exchange between these two matrices, the 
between-cluster scatter going up as the within-cluster scatter goes down. This is for- 
tunate, since by trying to minimize the within-cluster scatter we will also tend to 
maximize the between-cluster scatter. 

To be more precise in talking about the amount of within-cluster or between- 
cluster scatter, we need a scalar measure of the “size” of a scatter matrix. The two 
measures that we shall consider are the trace and the determinant. In the univariate 
case, these two measures are equivalent, and we can define an optimal partition as one 
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that minimizes Sw or maximizes Sg. In the multivariate case things are somewhat 
more complicated, and a number of related but distinct optimality criteria have been 
suggested. 


The Trace Criterion 


Perhaps the simplest scalar measure of a scatter matrix is its trace — the sum of its 
diagonal elements. Roughly speaking, the trace measures the square of the scattering 
radius, since it is proportional to the sum of the variances in the coordinate directions. 
Thus, an obvious criterion function to minimize is the trace of Sy. In fact, this 
criterion is nothing more or less than the sum-of-squared-error criterion, since the 
definitions of scatter matrices (Eqs. 56 & 57) yield 


tr aps oe Ed XO lx mill? = Je. (61) 
i=1 


i=1 xED; 


Since trSr = trSy+trS p and trSr is independent of how the samples are partitioned, 
we see that no new results are obtained by trying to maximize trSg. However, it is 
comforting to know that in seeking to minimize the within-cluster criterion Je = trSw 
we are also maximizing the between-cluster criterion 


e 


trSp = Y nilm; — m]|?. (62) 


i=l 


The Determinant Criterion 


In Sect. ?? we used the determinant of the scatter matrix to obtain a scalar measure 
of scatter. Roughly speaking, the determinant measures the square of the scattering 
volume, since it is proportional to the product of the variances in the directions of 
the principal axes. Since Sz will be singular if the number of clusters is less than or 
equal to the dimensionality, |Sg| is obviously a poor choice for a criterion function. 
Furthermore, Sg may become singular, and will certainly be so if n — c is less than 
the dimensionality d (Problem 27). However, if we assume that Sw is nonsingular, 
we are led to consider the determinant criterion function 


ys: 
i=1 


The partition that minimizes Ją is often similar to the one that minimizes Je, 
but the two need not be the same, as shown in Example 3. We observed before that 
the minimum-squared-error partition might change if the axes are scaled, though this 
does not happen with J¿ (Problem 26). Thus Ja is to be favored under conditions 
where there may be unknown or irrelevant linear transformations of the data. 


Ja = |Sw|= f (63) 


Invariant Criteria 


It is not particularly hard to show that the eigenvalues \1,..., Ag of Sp Sp are invari- 
ant under nonsingular linear transformations of the data (Problem ??). Indeed, these 
eigenvalues are the basic linear invariants of the scatter matrices. Their numerical 
values measure the ratio of between-cluster to within-cluster scatter in the direction 
of the eigenvectors, and partitions that yield large values are usually desirable. Of 
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course, as we pointed out in Sect. ??, the fact that the rank of Sg can not exceed 
c—1 means that no more than c—1 of these eigenvalues can be nonzero. Nevertheless, 
good partitions are ones for which the nonzero eigenvalues are large. 

One can invent a great variety of invariant clustering criteria by composing appro- 
priate functions of these eigenvalues. Some of these follow naturally from standard 
matrix operations. For example, since the trace of a matrix is the sum of its eigen- 
values, one might elect to maximize the criterion function 


d 
trSy Sp = 5 Ai (64) 
1=1 


By using the relation Sr = Sw + Sz, one can derive the following invariant relatives 
of [trSw and |Syw| (Problem 25): 


d 
1 
Jp =tr87 Sw = >> (65) 
i t+ Ai 


and 


d 
Sw] _ 1 
[Sr] -I 1+; (66) 


Since all of these criterion functions are invariant to linear transformations, the 
same is true of the partitions that extremize them. In the special case of two clusters, 
only one eigenvalue is nonzero, and all of these criteria yield the same clustering. 
However, when the samples are partitioned into more than two clusters, the optimal 
partitions, though often similar, need not be the same, as shown in Example 3. 


Example 3: Clustering criteria | 


We can gain some intuition by considering these criteria applied to the following 
data set. 


sample Ly La sample ry Lo 
1 -1.82 | 0.24 11 0.41 | 0.91 
2 -0.38 | -0.39 12 1.70 | 0.48 
3 -0.13 | 0.16 13 0.92 | -0.49 
4 -1.17 | 0.44 14 2.41 | 0.32 
5 -0.92 | 0.16 15 1.48 | -0.23 
6 -1.69 | -0.01 16 -0.34 | 1.88 
7 0.33 | -0.17 17 0.83 | 0.23 
8 -0.71 | -0.21 18 0.62 | 0.81 
9 1.27 | -0.39 19 -1.42 | -0.51 
10 -0.16 | -0.23 20 0.67 | -0.55 


All of the clusterings seem reasonable, and there is no strong argument to favor one 
over the others. For the case c = 2, the clusters minimizing the Je indeed tend to favor 
clusters of roughly equal numbers of points, as illustrated in Fig. 10.9; in contrast, 
Ja favors one large and one fairly small cluster. Since the full data set happens to 
be spread horizontally more than vertically, the eigenvalue in the horizontal direction 
is greater than that in the vertical direction. As such, the clusters are “stretched” 
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The clusters found by minimizing a criterion depends upon the criterion function 
as well as the assumed number of clusters. The sum-of-squared-error criterion Je 
(Eq. 49), the determinant criterion Ja (Eq. 63) and the more subtle trace criterion 
J; (Eq. 65) were applied to the 20 points in the table with the assumption of c = 2 
and c = 3 clusters. (Each point in the table is shown, with bounding boxes defined 
by —1.8 < x < 2.5 and —0.6 < y < 1.9.) 


horizontally somewhat. In general, the differences between the cluster criteria become 
less pronounced for large numbers of clusters. For the c = 3 case, for instance, the 
clusters depend only mildly upon the cluster criterion — indeed, two of the clusterings 
are identical. 


With regard to the criterion function involving Sy, note that Sy does not depend 
on how the samples are partitioned into clusters. Thus, the clusterings that minimize 
[Sw]|/|Sr] are exactly the same as the ones that minimize [Sy]. If we rotate and scale 
the axes so that Sy becomes the identity matrix, we see that minimizing tr[S7 Sw] 
is equivalent to minimizing the sum-of-squared-error criterion trSw after performing 
this normalization. Clearly, this criterion suffers from the very defects that we warned 
about in Sect. ??, and it is probably the least desirable of these criteria. 

One final warning about invariant criteria is in order. If different apparent clusters 
can be obtained by scaling the axes or by applying any other linear transformation, 
then all of these groupings will be exposed by invariant procedures. Thus, invariant 
criterion functions are more likely to possess multiple local extrema, and are corre- 
spondingly more difficult to optimize. 

The variety of the criterion functions we have discussed and the somewhat subtle 
differences between them should not be allowed to obscure their essential similarity. In 
every case the underlying model is that the samples form c fairly well separated clouds 
of points. The within-cluster scatter matrix Sw is used to measure the compactness 
of these clouds, and the basic goal is to find the most compact grouping. While this 
approach has proved useful for many problems, it is not universally applicable. For 
example, it will not extract a very dense cluster embedded in the center of a diffuse 
cluster, or separate intertwined line-like clusters. For such cases one must devise other 
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criterion functions that are better matched to the structure present or being sought. 


10.8 *Iterative Optimization 


Once a criterion function has been selected, clustering becomes a well-defined problem 
in discrete optimization: find those partitions of the set of samples that extremize the 
criterion function. Since the sample set is finite, there are only a finite number of 
possible partitions. Thus, in theory the clustering problem can always be solved 
by exhaustive enumeration. However, the computational complexity renders such an 
approach unthinkable for all but the simplest problems; there are approximately c"/c! 
ways of partitioning a set of n elements into c subsets, and this exponential growth 
with n is overwhelming (Problem 17). For example an exhaustive search for the best 
set of 5 clusters in 100 samples would require considering more than 10%” partitionings. 
Simply put, in most applications an exhaustive search is completely infeasible. 

The approach most frequently used in seeking optimal partitions is iterative op- 
timization. The basic idea is to find some reasonable initial partition and to “move” 
samples from one group to another if such a move will improve the value of the cri- 
terion function. Like hill-climbing procedures in general, these approaches guarantee 
local but not global optimization. Different starting points can lead to different solu- 
tions, and one never knows whether or not the best solution has been found. Despite 
these limitations, the fact that the computational requirements are bearable makes 
this approach attractive. 

Let us consider the use of iterative improvement to minimize the sum-of-squared- 
error criterion Je, written as 


e 
Jem Y da, (67) 
i=1 
where an effective error per cluster is defined to be 


i= Y |x- mill? (68) 


xED; 


and the mean of each cluster is, as before, 
1 
m ==>) x. (48) 


Suppose that a sample $ currently in cluster D; is tentatively moved to D;. Then m; 
changes to 


X— mj 
nj +1 


¡A 


ms (69) 


mj + 
and J; increases to 


2 ^ 2 
Jj = Y Ix-mjl?+/[%-mj| 
xED;, 


S ; X — Mj; ¡2 j Ni rg I2 
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xED; 
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nj A 
= d+ IR ml. (70) 


Under the assumption that n; 4 1 (singleton clusters should not be destroyed), a 
similar calculation (Problem 29) shows that m, changes to 


(71) 
and J; decreases to 


|| — m,||?. (72) 


These equations greatly simplify the computation of the change in the criterion 
function. The transfer of x from D; to D; is advantageous if the decrease in J; is 
greater than the increase in J;. This is the case if 


Ni a Nj 
— | =- ny? > —— 


E 2 
= T Hlk- male (73) 


which typically happens whenever x is closer to m; than m;. If reassignment is 
profitable, the greatest decrease in sum of squared error is obtained by selecting the 
cluster for which n;/(nj + 1)||% — m,||? is minimum. This leads to the following 
clustering procedure: 


Algorithm 3 (Basic iterative minimum-squared-error clustering) 


1 begin initialize n,c,m,,m>,...,m. 
2 do randomly select a sample x; 
3 i — arg min |m; — || (classify £) 
EY 
4 if n; 4 1 then compute 
Nj A . 2 A . 
: al a E 
malk- m| j=i 
6 if pp < pj for all j then transfer x to Dk 
7 recompute Je, m;, Mk 
8 until no change in Je in n attempts 
9 return mı, M2,..., Me 
10 end 


A moment’s consideration will show that this procedure is is essentially a sequen- 
tial version of the k-means procedure (Algorithm 1) described in Sect. 10.4.3. Where 
the k-means procedure waits until all n samples have been reclassified before updat- 
ing, the Basic Iterative Minimum-Squared-Error procedure updates after each sample 
is reclassified. It has been experimentally observed that this procedure is more suscep- 
tible to being trapped in local minima, and it has the further disadvantage of making 
the results depend on the order in which the candidates are selected. However, it is at 
least a stepwise optimal procedure, and it can be easily modified to apply to problems 
in which samples are acquired sequentially and clustering must be done on-line. 

One question that plagues all hill-climbing procedures is the choice of the starting 
point. Unfortunately, there is no simple, universally good solution to this problem. 
One approach is to select c samples randomly for the initial cluster centers, using 
them to partition the data on a minimum-distance basis. Repetition with different 
random selections can give some indication of the sensitivity of the solution to the 
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starting point. Yet another approach is to find the c-cluster starting point from the 
solutions to the (c — a)-cluster problem. The solution for the one-cluster problem is 
the total sample mean; the starting point for the c-cluster problem can be the final 
means for the (c— a)-cluster problem plus the sample that is farthest from the nearest 
cluster center. This approach leads us directly to the so-called hierarchical clustering 
procedures, which are simple methods that can provide very good starting points for 
iterative optimization. 


10.9 Hierarchical Clustering 


Up to now, our methods have formed disjoint clusters — in computer science terminol- 
ogy, we would say that the data description is “flat.” However, there are many times 
when clusters have subclusters, these have sub-subclusters, and so on. In biological 
taxonomy, for instance, kingdoms are split into phylums, which are split into subphy- 
lums, which are split into orders, and suborders, and families, and subfamilies, and 
genus and species, and so on, all the way to a particular individual organism. Thus 
we might have kingdom = animal, phylum = Chordata, subphylum = Vertebrata, 
class = Osteichthyes, subclass = Actinopterygii, order = Salmoniformes, family = 
Salmonidae, genus = Oncorhynchus, species = Oncorhynchus kisutch, and individual 
= the particular Coho salmon caught in my net. Organisms that lie in the animal 
kingdom — such as a salmon and a moose — share important attributes that are not 
present in organisms in the plant kingdom, such as redwood trees. In fact, this kind of 
hierarchical clustering permeates classifactory activities in the sciences. Thus we now 
turn to clustering methods which will lead to representations that are “hierarchical,” 
rather than flat. 


10.9.1 Definitions 


Let us consider a sequence of partitions of the n samples into c clusters. The first of 
these is a partition into n clusters, each cluster containing exactly one sample. The 
next is a partition into n — 1 clusters, the next a partition into n — 2, and so on until 
the nth, in which all the samples form one cluster. We shall say that we are at level 
k in the sequence when c = n — k + 1. Thus, level one corresponds to n clusters and 
level n to one cluster. Given any two samples x and x’, at some level they will be 
grouped together in the same cluster. If the sequence has the property that whenever 
two samples are in the same cluster at level k they remain together at all higher levels, 
then the sequence is said to be a hierarchical clustering. 

The most natural representation of hierarchical clustering is a corresponding tree, 
called a dendrogram, which shows how the samples are grouped. Figure 10.10 shows 
a dendrogram for a simple problem involving eight samples. Level 1 shows the eight 
samples as singleton clusters. At level 2, samples xg and x7 have been grouped to 
form a cluster, and they stay together at all subsequent levels. If it is possible to 
measure the similarity between clusters, then the dendrogram is usually drawn to 
scale to show the similarity between the clusters that are grouped. In Fig. 10.10, for 
example, the similarity between the two groups of samples that are merged at level 5 
has a value of roughly 60. 

We shall see shortly how such similarity values can be obtained, but first note that 
the similarity values can be used to help determine whether groupings are natural or 
forced. If the similarity values for the levels are roughly evenly distributed throughout 
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the range of possible values, then there is no principled argument that any particular 
number of clusters is better or “more natural” than another. Conversely, suppose that 
there is a unusually large gap between the similarity values for the levels corresponding 
to c = 3 and to c = 4 clusters. In such a case, one can argue that c = 3 is the most 
natural number of clusters (Problem 35). 
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Figure 10.10: A dendrogram can represent the results of hierarchical clustering algo- 
rithms. The vertical axis shows a generalized measure of similarity among clusters. 
Here, at level 1 all eight points lie in singleton clusters; each point in a cluster is 
highly similar to itself, of course. Points xg and x7 happen to be the most similar, 
and are merged at level 2, and so forth. 


Another representation for hierarchical clustering is based on sets, in which each 
level of cluster may contain sets that are subclusters, as shown in Fig. 10.11. Yet an- 
other, textual, representation uses brackets, such as: {{x1, {x2,x3}}, {{{x4, x5}, {x6,x7}},xs}}. 
While such representations may reveal the hierarchical structure of the data, they do 
not naturally represent the similarities quantitatively. For this reason dendrograms 
are generally preferred. 


Figure 10.11: A set or Venn diagram representation of two-dimensional data (which 
was used in the dendrogram of Fig. 10.10) reveals the hierarchical structure but not 
the quantitative distances between clusters. The levels are numbered in red. 


Because of their conceptual simplicity, hierarchical clustering procedures are among 

the best-known of unsupervised methods. The procedures themselves can be divided 

AGGLOMER- according to two distinct approaches — agglomerative and divisive. Agglomerative 
ATIVE (bottom-up, clumping) procedures start with n singleton clusters and form the se- 
quence by successively merging clusters. Divisive (top-down, splitting) procedures 
start with all of the samples in one cluster and form the sequence by successively 
splitting clusters. The computation needed to go from one level to another is usually 


DIVISIVE 
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simpler for the agglomerative procedures. However, when there are many samples 
and one is interested in only a small number of clusters, this computation will have 
to be repeated many times. For simplicity, we shall concentrate on agglomerative 
procedures, and merely touch on some divisive methods in Sect. 10.12. 


10.9.2 Agglomerative Hierarchical Clustering 


The major steps in agglomerative clustering are contained in the following procedure, 
where c is the desired number of final clusters: 


Algorithm 4 (Agglomerative hierarchical clustering) 


1 begin initialize c,é — n, D; — {x;},7=1,...,n 


2 doé+é-1 
3 Find nearest clusters, say, D; and D; 
4 Merge D; and D; 

5 until c=é 

6 return c clusters 

7 end 


As described, this procedure terminates when the specified number of clusters has been 
obtained and returns the clusters, described as set of points (rather than as mean or 
representative vectors). If we continue until c = 1 we can produce a dendrogram like 
that in Fig. 10.10. At any level the “distance” between nearest clusters can provide 
the dissimilarity value for that level. Note that we have not said how to measure the 
distance between two clusters, and hence how to find the “nearest” clusters, required 
by line 3 of the Algorithm. The considerations here are much like those involved 
in selecting a general clustering criterion function. For simplicity, we shall generally 
restrict our attention to the following distance measures: 


dinin Di, Dj) = min ||x ~~ x" || 74) 
x!ED; 
= _ / 
dmaz Dis Dz) NN prs ||x x | 75) 
x/ED; 
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davg D;,D;) = niña de 5 [x= — x’ || 76) 
wo xED; x ED); 
dmean Di, Dj) = ||; z m;||. 77) 


All of these measures have a minimum-variance flavor, and they usually yield the same 
results if the clusters are compact and well separated. However, if the clusters are 
close to one another, or if their shapes are not basically hyperspherical, quite different 
results can be obtained. Below we shall illustrate some of the differences. 

But first let us consider the computational complexity of a particularly simple 
agglomerative clustering algorithm. Suppose we have n patterns in d-dimensional 
space, and we seek to form c clusters using dmin(Di, Dj) defined in Eq. 74. We 
will, once and for all, need to calculate n(n — 1) inter-point distances — each of 
which is an O(d?) calculation — and place the results in an inter-point distance 
table. The space complexity is, then, O(n?). Finding the minimum distance pair 
(for the first merging) requires that we step through the complete list, keeping the 
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index of the smallest distance. Thus for the first agglomerative step, the complexity 
is O(n(n — 1)(d? + 1)) = O(n?d?). For an arbitrary agglomeration step (i.e., from é 
to ¿— 1), we need merely step through the n(n — 1) — é “unused” distances in the 
list and find the smallest for which x and x’ lie in different clusters. This is, again, 
O(n(n— 1) — ĉ). The full time complexity is thus O(cn?d?), and in typical conditions 
n>c* 


The Nearest-Neighbor Algorithm 


When dmin is used to measure the distance between clusters (Eq. 74) the algorithm 
is sometimes called the nearest-neighbor cluster algorithm, or minimum algorithm 
Moreover, if it is terminated when the distance between nearest clusters exceeds an 
arbitrary threshold, it is called the single-linkage algorithm. Suppose that we think 
of the data points as being nodes of a graph, with edges forming a path between the 
nodes in the same subset D;. When dmin is used to measure the distance between 
subsets, the nearest neighbor nodes determine the nearest subsets. The merging of 
D; and D; corresponds to adding an edge between the nearest pair of nodes in D; 
and D;. Since edges linking clusters always go between distinct clusters, the resulting 
graph never has any closed loops or circuits; in the terminology of graph theory, this 
procedure generates a tree. If it is allowed to continue until all of the subsets are 
linked, the result is a spanning tree — a tree with a path from any node to any other 
node. Moreover, it can be shown that the sum of the edge lengths of the resulting 
tree will not exceed the sum of the edge lengths for any other spanning tree for that 
set of samples (Problem 37). Thus, with the use of dmin as the distance measure, the 
agglomerative clustering procedure becomes an algorithm for generating a minimal 
spanning tree. 

Figure 10.12 shows the results of applying this procedure to Gaussian data. In 
both cases the procedure was stopped giving two large clusters (plus three singleton 
outliers); a minimal spanning tree can be obtained by adding the shortest possible edge 
between the two clusters. In the first case where the clusters are fairly well separated, 
the obvious clusters are found. In the second case, the presence of a point located so 
as to produce a bridge between the clusters results in a rather unexpected grouping 
into one large, elongated cluster, and one small, compact cluster. This behavior is 
often called the “chaining effect,” and is sometimes considered to be a defect of this 
distance measure. To the extent that the results are very sensitive to noise or to slight 
changes in position of the data points, this is certainly a valid criticism. 


The Farthest-Neighbor Algorithm 


When dmax (Eq. 75) is used to measure the distance between subsets, the algorithm is 
sometimes called the farthest-neighbor clustering algorithm, or maximum algorithm. 
If it is terminated when the distance between nearest clusters exceeds an arbitrary 
threshold, it is called the complete-linkage algorithm. The farthest-neighbor algorithm 
discourages the growth of elongated clusters. Application of the procedure can be 
thought of as producing a graph in which edges connect all of the nodes in a cluster. 
In the terminology of graph theory, every cluster constitutes a complete subgraph. 
The distance between two clusters is determined by the most distant nodes in the two 


* There are methods for sorting or arranging the entries in the inter-point distance table so as 
to easily avoid inspection of points in the same cluster, but these typically do not improve the 
complexity results significantly. 
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Figure 10.12: Two Gaussians were used to generate two-dimensional samples, shown 
in pink and black. The nearest-neighbor clustering algorithm gives two clusters that 
well approximate the generating Gaussians (left). Tf, however, another particular 
sample is generated (red point at the right) and the procedure re-started, the clusters 
do not well approximate the Gaussians. This illustrates how the algorithm is sensitive 
to the details of the samples. 


clusters. When the nearest clusters are merged, the graph is changed by adding edges 
between every pair of nodes in the two clusters. 

If we define the diameter of a partition as the largest diameter for clusters in 
the partition, then each iteration increases the diameter of the partition as little 
as possible. As Fig. 10.13 illustrates, this is advantageous when the true clusters 
are compact and roughly equal in size. Nevertheless, when this is not the case — as 
happens with the two elongated clusters — the resulting groupings can be meaningless. 
This is another example of imposing structure on data rather than finding structure 
in it. 


Compromises 


The minimum and maximum measures represent two extremes in measuring the dis- 
tance between clusters. Like all procedures that involve minima or maxima, they 
tend to be overly sensitive to “outliers” or “wildshots.” The use of averaging is an 
obvious way to ameliorate these problems, and davg and dmean (Eqs. 76 dz 77) are 
natural compromises between din and dmax. Computationally, dmean is the simplest 
of all of these measures, since the others require computing all nin; pairs of distances 
|x — x’||. However, a measure such as day, can be used when the distances ||x — x’|| 
are replaced by similarity measures, where the similarity between mean vectors may 
be difficult or impossible to define. 


10.9.3 Stepwise-Optimal Hierarchical Clustering 


We observed earlier that if clusters are grown by merging the nearest pair of clus- 
ters, then the results have a minimum variance flavor. However, when the measure 
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Figure 10.13: The farthest-neighbor clustering algorithm uses the separation between 
the most distant points as a criterion for cluster membership. If this distance is set 
very large, then all points lie in the same cluster. In the case shown at the left, a 
fairly large dmax leads to three clusters; a smaller dmax gives four clusters, as shown 
at the right. 


of distance between clusters is chosen arbitrarily, one can rarely assert that the re- 
sulting partition extremizes any particular criterion function. In effect, hierarchical 
clustering defines a cluster as whatever results from applying the clustering procedure. 
Nevertheless, with a simple modification it is possible to obtain a stepwise-optimal 
procedure for extremizing a criterion function. This is done merely by replacing line 3 
of the Basic Iterative Agglomerative Clustering Procedure (Algorithm 4) by a more 
general form to get: 


Algorithm 5 (Stepwise optimal hierarchical clustering) 


1 begin initialize c,¿ — n, D; — {x;},i=1,...,n 

2 doc+c¢-1l 

3 Find clusters whose merger changes the criterion the least, say, D; and Dj 
4 Merge D; and D; 

5 until c = ĉ 

6 return c clusters 

7 end 


We saw earlier that the use of dmax causes the smallest possible stepwise increase 
in the diameter of the partition. Another simple example is provided by the sum- 
of-squared-error criterion function J.. By an analysis very similar to that used in 
Sect. ??, we find that the pair of clusters whose merger increases Je as little as 
possible is the pair for which the “distance” 


NN; 


de(Di, Dj) = |m; — m; || (78) 


Ni + Nj 
is minimum (Problem 34). Thus, in selecting clusters to be merged, this criterion takes 
into account the number of samples in each cluster as well as the distance between 
clusters. In general, the use of de tends to favor growth by merging singletons or 
small clusters with large clusters over merging medium-sized clusters. While the final 
partition may not minimize Je, it usually provides a very good starting point for 
further iterative optimization. 
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10.9.4 Hierarchical Clustering and Induced Metrics 


Suppose that we are unable to supply a metric for our data, but that we can measure a 
dissimilarity value 6(x, x’) for every pair of samples, where 6(x, x’) > 0, with equality 
holding if and only if x = x’. Then agglomerative clustering can still be used, with the 
understanding that the nearest pair of clusters is the least dissimilar pair. Interestingly 
enough, if we define the dissimilarity between two clusters by 


— m; / 
dm Paes) = min d(x, x ) (79) 
x/ED; 
or 
N = / 
O miaz (Di, Di) = es ô(x, x ) (80) 
x/ED; 


then the hierarchical clustering procedure will induce a distance function for the given 
set of n samples. Furthermore, the ranking of the distances between samples will be 
invariant to any monotonic transformation of the dissimilarity values (Problem 18). 

We can now define a generalized distance d(x,x') between x and x’ as the value 
of the lowest level clustering for which x and x’ are in the same cluster. To show that 
this is a legitimate distance function, or metric, we need to show four things: for all 
vectors x, x’ and x” 


non-negativity: d(x,x’) > 0 

reflexivity: d(x, x’) = 0 if and only if x = x’ 
symmetry: d(x,x’) = d(x’, x) 

triangle inequality: d(x, x’) + d(x’,x”) > d(x, x”). 


It is easy to see that these requirements are satisfied and hence that dissimilarity can 
induce a metric. For our formula for dissimilarity, we have moreover that 


d(x, x”) < maz[d(x, x’), d(x",x")] for any x’ (81) 


in which case we say that d(-,-) is an ultrametric (Problem 31). Ultrametric criteria 
can be more immune to local minima problems since stricter ordering of distances 
among clusters is maintained. 


10.10 *The Problem of Validity 


With almost all of the procedures considered thus far we have assumed that the num- 
ber of clusters is known. That is a reasonable assumption if we are upgrading a 
classifier that has been designed on a small labeled set, or if we are tracking slowly 
time-varying patterns. However, it may be an unjustified assumption if we are ex- 
ploring a data set whose properties are, at base, unknown. Thus, a recurring problem 
in cluster analysis is that of deciding just how many clusters are present. 

When clustering is done by extremizing a criterion function, a common approach 
is to repeat the clustering procedure for c = 1, c = 2, c = 3, etc., and to see how the 
criterion function changes with c. For example, it is clear that the sum-of-squared- 
error criterion Je must decrease monotonically with c, since the squared error can 
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be reduced each time c is increased merely by transferring a single sample to a new 
singleton cluster. If the n samples are really grouped into € compact, well separated 
clusters, one would expect to see J. decrease rapidly until ê = c, decreasing much 
more slowly thereafter until it reaches zero at c = n. Similar arguments have been 
advanced for hierarchical clustering procedures and can be apparent in a dendrogram, 
the usual assumption being that a large disparity in the levels at which clusters merge 
indicates the presence of natural groupings. 

A more formal approach to this problem is to devise some measure of goodness of 
fit that expresses how well a given c-cluster description matches the data. The chi- 
squared and Kolmogorov-Smirnov statistics are the traditional measures of goodness 
of fit, but the curse of dimensionality usually demands the use of simpler measures, 
some criterion function, which we denote J(c). Since we expect a description in terms 
of c+1 clusters to give a better fit than a description in terms of c clusters, we would 
like to know what constitutes a statistically significant improvement in J(c). 

A formal way to proceed is to advance the null hypothesis that there are exactly 
c clusters present, and to compute the sampling distribution for J(c + 1) under this 
hypothesis. This distribution tells us what kind of apparent improvement to expect 
when a c-cluster description is actually correct. The decision procedure would be 
to accept the null hypothesis if the observed value of J(c + 1) falls within limits 
corresponding to an acceptable probability of false rejection. 

Unfortunately, it is usually very difficult to do anything more than crudely esti- 
mate the sampling distribution of J(c + 1). The resulting solutions are not above 
suspicion, and the statistical problem of testing cluster validity is still essentially un- 
solved. However, under the assumption that a suspicious test is better than none, 
we include the following approximate analysis for the simple sum-of-squared-error 
criterion which closely parallels our discussion in Chap. ??. 

Suppose that we have a set D of n samples and we want to decide whether or not 
there is any justification for assuming that they form more than one cluster. Let us 
advance the null hypothesis that all n samples come from a normal population with 
mean u and covariance matrix o7I.* If this hypothesis were true, multiple clusters 
found would have to have been formed by chance, and any observed decrease in the 
sum-of-squared error obtained by clustering would have no significance. 

The sum of squared error J.(1) is a random variable, since it depends on the 
particular set of samples: 


Je(1) = $ IIx- ml)’, (82) 


xED 


where m is the sample mean of the full data set. Under the null hypothesis, the 
distribution for J.(1) is approximately normal with mean ndo? and variance 2ndo* 
(Problem 38). Suppose now that we partition the set of samples into two subsets Dı 
and Də so as to minimize J.(2), where 


2 


Je(2)= X Y lx- mill’, (83) 


i=1 xED; 


m; being the mean of the samples in D;. Under the null hypothesis, this partitioning 
is spurious, but it nevertheless results in a value for Je(2) that is smaller than J,(1). 


* We could of course assume a different cluster form, but in the absence of further information, the 
Gaussian can be justified on the grounds we have discussed before. 
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If we knew the sampling distribution for J.(2), we could determine how small Je(2) 
would have to be before we were forced to abandon a one-cluster null hypothesis. 
Lacking an analytical solution for the optimal partitioning, we cannot derive an exact 
solution for the sampling distribution. However, we can obtain a rough estimate by 
considering the suboptimal partition provided by a hyperplane through the sample 
mean. For large n, it can be shown that the sum of squared error for this partition is 
approximately normal with mean n(d — 2/1)0? and variance 2n(d — 8/1?)o*. 

This result agrees with out statement that J.(2) is smaller than J.(1), since the 
mean of Je(2) for the suboptimal partition — n(d — 2/1)0? — is less than the mean 
for J.(1) — ndo?. To be considered significant, the reduction in the sum-of-squared 
error must certainly be greater than this. We can obtain an approximate critical value 
for J.(2) by assuming that the suboptimal partition is nearly optimal, by using the 
normal approximation for the sampling distribution, and by estimating 0? according 
to 


1 1 
52 — — = 2 = — 
a 2 m|| zg) (84) 


The final result can be stated as follows (Problem 39): Reject the null hypothesis at 
the p-percent significance level if 


Je(2) 2 2(1 — 8/m2d) 
1 
ii ” a (85) 
where a is determined by 
T 1 —u? /2 
p = 100 ane du = 100(1 — erf(a)), (86) 


a 


and erf(-) is the standard error function. This provides us with a test for deciding 
whether or not the splitting of a cluster is justified. Clearly the c-cluster problem can 
be treated by applying the same test to all clusters found. 


10.11 Competitive Learning 


A clustering algorithm related to decision-directed versions of k-means (Algorithm 1) 
is based on neural network learning rules (Chap. ??) and called competitive learning. 
In both procedures, the number of desired clusters and their centers are initialized, 
and during clustering each pattern is provisionally classified into one of the clusters. 
The methods of updating the cluster centers differ, however. In the decision-directed 
method, each cluster center is calculated as the mean of the current provisional mem- 
bers. In competitive learning, the adjustment is confined to the single cluster center 
most similar to the pattern presented. As a result, in competitive learning clus- 
ters that are “far away” from the current pattern tend not to be altered (but see 
Sect. 10.11.2) — sometimes considered a desirable property. The drawback is that 
the solution need not minimize a single global cost or criterion function. 

We now turn to the specific competitive learning algorithm. For reasons that will 
become clear, each d-dimensional pattern is augmented (with xo = 1) and normalized 
to have length ||x|| = 1; thus all patterns lie on the surface of a d-dimensional sphere. 
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Figure 10.14: The two-layer network which implements the competitive learning al- 
gorithm consists of d + 1 input units and c output or cluster units. Each augmented 
input pattern is normalized to unit length, i.e., ||x|| = 1, as is the set of weights at 
each cluster unit. When a pattern is presented, each of the cluster units computes 
its net activation net; = Wix; only the weights at the most active cluster unit are 
modified. (The suppression of activity in all but the most active cluster units can be 
implemented by competition among these units, as indicated by the red arrows.) The 
weights of the most active unit are then modified to be more similar to the pattern 
presented. 


The competitive learning algorithm can be understood by its neural network imple- 
mentation (Fig. 10.14), which resembles a Perceptron network (Chapt. ??, Fig. ??), 
with input units fully connected to c output or cluster units. 

Each of the c cluster centers is initialized with a randomly chosen weight vector, 
also normalized ||w;|| = 1, 7 = 1,...c. It is traditional but not required to initialize 
cluster centers to be c points randomly selected from the data. When a new pattern 
is presented, each of the cluster units computes its net activation, net; = wyXx. Only 
the most active neuron (i.e., the closest to the new pattern) is permitted to update 
its weights. While this selection of the most active unit is algorithmically trivial, it 
can be implemented in a winner-take-all network, where each cluster unit j inhibits 
others by an amount proportional to net;, as shown by the red arrows in Fig. 10.14. 
It is this competition between cluster units, and the rsulting suppression of activity 
in all but the one with the largest net that gives the algorithm its name. 

Learning is confined to the weights at the most active unit. The weight vector at 
this unit is updated to be more like the pattern: 


w(t+ 1) = w(t) + nx, (87) 


d 
where y is a learning rate. The weights are then normalized to insure > w? = 1. 


This normalization is needed to keep the classification and clustering based on the 
position in feature space rather than overall magnitude of w. Without such weight 
normalization, a single weight, say wj, could grow in magnitude and forever give 
the greatest value net; , and through competition thereby prevent other clusters from 
learning. Figure 10.15 shows the trajectories of three cluster centers in response to a 
sequence of patterns chosen randomly from the set shown. 
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Algorithm 6 (Competitive learning) 


1 begin initialize 1, n, cC, W1, W2,..., Wc 

2 xi — {1,x;} i = 1,...n augment all patterns 
3 x; — x;/||x;|| i = 1,...n normalize all patterns 
4 do randomly select a pattern x 

5 j — arg max wx classify x 

6 wj; wj +nx weight update 

7 wj <= w;/||w;|| weight normalization 

8 until no significant change in w in n attempts 
9 return Wj, W2,..., Wc 
10 end 

X3, W3 


Xp W] 
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Figure 10.15: All of the three-dimensional patterns have been normalized (Y) x? = 1), 
i=1 

and hence lie on a two-dimensional sphere. Likewise, the weights of the three cluster 

centers have been normalized. The red curves show the trajectory of the weight 


vectors; at the end of learning, each lies near the center of a cluster. 


A drawback of Algorithm 6 is that there is no guarantee that it will terminate, 
even for a finite, non-pathological data set — the condition in line 8 may never be 
satisfied and thus the weights may vary forever. A simple heuristic is to decay the 
learning rate in line 6 , for instance by 7(t) = 7(0)a* for a < 1 where t is an iteration 
number. If the initial cluster centers are representative of the full data set, and the 
rate of decay is set so that the full data set is presented at least several times before the 
learning is reduced to very small values, then good results can be expected. However 
if then a novel pattern is added, it cannot be learning, since 7 is too small. Likewise, 
such a learning decay scheme is inappropriate if we seek to track gradual changes in 
the data. 


In a non-stationary environment, a we may want a clustering algorithm to be 
stable to prevent ceaseless recoding, and yet plastic, or changeable, in response to a 
new pattern. (Freezing cluster centers would prevent recoding, but would not per- 
mit learning of new patterns.) This tradeoff has been called the stability-plasticity 
dilemma, and we shall see in Sect. 10.11.2 how it can be addressed. First, however, 
we turn to the problem of unknown number of clusters. 
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10.11.1 Unknown number of clusters 


We have mentioned the problem of unknown number of cluster centers. When the 
number is unknown, we can proceed in one of two general ways. In the first, we 
compare some cluster criterion as a function of the number of clusters. If there is a 
large gap in the criterion values, it suggests a “natural” number of clusters. A second 
approach is to state a threshold for the creation of a new cluster. This is useful in 
on-line cases. The drawback is that it depends more strongly on the order of data 
presentation. 

Whereas clustering algorithms such as k-means and hierarchical clustering typi- 
cally have all data present before clustering begins (i.e., are off-line), there are occa- 
sionally situations in which clustering must be performed on-line as the data streams 
in, for instance when there is inadequate memory to store all the patterns themselves, 
or in a time-critical situation where the clusters need to be used even before the full 
data is present. Our graph theoretic methods can be performed on-line — one merely 
links the new pattern to an existing cluster based on some similarity measure. 

In order to make on-line versions of methods such as k-means, we will have to be a 
bit more careful. Under these conditions, the best approach generally is to represent 
clusters by their “centers” (e.g., means) and update these centers based solely on its 
current value and the incoming pattern. Here we shall assume that the number of 
clusters is known, and return in Sect. ?? to the case where it is not known. 

Suppose we currently have c cluster centers; they may have been placed initially 
at random positions, or as the first c patterns presented, or the current state after any 
number of patterns have been presented. The simplest approach is to alter only the 
cluster center most similar to a new pattern being presented, and the cluster center 
is changed to be somewhat more like the pattern (Fig. 10.16). 


Figure 10.16: In leader-follower clustering, the number of clusters and their centers 
depend upon the random sequence of data presentations. The three simulations shown 
employed the same learning rate 7, threshold 0, and number of presentations of each 
point (50), but differ in the random sequence of presentations. Notice that in the 
simulation on the left, three clusters are generated whereas in the other simulations, 
only two. 


If we let w; represent the current center for cluster 2, 7 a learning rate and introduce 
a threshold 0, a relative of the Basic leader-follower clustering algorithm is then: 


Algorithm 7 (Basic leader-follower clustering) 


1 begin initialize 7, 0 — threshold 
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Figure 10.17: Instability can arise when a pattern is assigned different cluster mem- 
berships at different times. Early in clustering the pattern marked x* lies in the black 
cluster, while later in clustering it lies in the red cluster. Similar pattern presentations 
can make x* alternate arbitrarily between clusters. 


2 m =x 
do accept new x 
4 j — arg min |x —y;|| (find nearest cluster) 
j 


5 if |x- ajll <0 
6 then uj pj +x 
7 else add new u — x 
8 p=Hu/lpll (normalize weight) 
9 until no more patterns 

10 return H1, fo, --- 

11 end 


Before we analyze some drawbacks of such a leader-follower clustering algorithm, 
let us consider one popular neural technique for achieving it. 


10.11.2 Adaptive Resonance 


The simplest adaptive resonance networks (or Adaptive Resonance Theory or ART 
networks) perform a modification of the On-line clustering with cluster creation pro- 
cedure we have just seen. While the primary motivation for ART was to explain 
biological learning, we shall not be concerned here with their biological relevance nor 
with their use in supervised learning (but see Problem 41). 

The above algorithm, however, can occasionally present a problem, regardless 
of whether it is implemented via competitive learning. Consider a cluster w , that 
originally codes a particular pattern xo, i.e., if xp is presented, the output node having 
weights w, is most activated. Suppose a “hostile” sequence of patterns is presented, 
i.e., one that sweeps the cluster centers in unusual ways (Fig. 10.17). It is possible 
that after the cluster centers have been swept, that xg is coded by wə. Indeed, a 
particularly devious sequence can lead xp to be coded by an arbitrary sequence of 
cluster centers, with any cluster center being active an arbitrary number of times. 

The network works as follows. First a pattern is presented to the input units. This 
leads via bottom-up connections wi; to activations in the output units. A winner- 


VIGILANCE 


VIGILANCE 
PARAMETER 


50 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 


Figure 10.18: Adaptive Resonance network (ART1 for binary patterns). Weights 
are bidirectional, gain, the orienting system controls the , and hence (indirectly) the 
number of clusters found. 


take-all computation leads to only the most activated output unit being active — all 
other output units are suppressed. Activation is then sent back to the input units 
via weights wji. This leads, in turn to a modification of the activation of the input 
units. Very quickly, a stable configuration of output and input units occurs, called 
a “resonance” (though this has nothing to do with the type of resonance in a driven 
oscillator). 

ART networks detect novelty by means of the orienting subsystem. The details 
need not concern us here, but in broad overview, the orienting subsystem has two 
inputs: the total number of active input features and the total number of features 
that are active in the input layer. (Note that these two numbers need not be the 
same, since the top-down feedback affects the activation of the input units, but not 
the number of active inputs themselves.) If an input pattern is “too different” from 
any current cluster centers, then the orienting subsystem sends a reset wave signal 
that renders the active output unit quiet. This allows a new cluster center to be 
found, or if all have been explored, then a new cluster center is created. 

The criterion for “too different” is a single number, set by the user, called the 
vigilance, p(0 < p < 1. Denoting the number of active input features as |Z| and the 
number active in the input layer during a resonance as |R|, then there will be a reset 
if 


[R] 

1 < P, (88) 
where rho is a user-set number called the vigilance parameter. A low vigilance pa- 
rameter means that there can be a poor “match” between the input and the learned 
cluster and the network will accept it. (Thus vigilance and the ratio of the number of 
features used by ART, while motivated by proportional considerations, is just one of 
an infinite number of possible closeness criteria (related to 6). For the same data set, 
a low vigilance leads to a small number of large coarse clusters being formed, while a 
high vigilance leads to a large number of fine clusters (Fig. 10.19). 

We have presented the basic approach and issues with ART1, but these return 
(though in a more subtle way) in analog versions of ART in the literature. 
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Figure 10.19: The results of ART1 applied to a sequence of binary figures. a) p = xa. 
b) p=0.xxz. 


10.12 *Graph Theoretic Methods 


Where the mathematics of normal mixtures and minimum-variance partitions leads 
us to picture clusters as isolated clumps, the language and concepts of graph theory 
lead us to consider much more intricate structures. Unfortunately, there is no uniform 
way of posing clustering problems as problems in graph theory. Thus, the effective 
use of these ideas is still largely an art, and the reader who wants to explore the 
possibilities should be prepared to be creative. 

We begin our brief look into graph-theoretic methods by reconsidering the simple 
procedures that produce the graphs shown in Fig. 10.6. Here a threshold distance do 
was selected, and two points are placed in the same cluster if the distance between 
them is less than dy. This procedure can easily be generalized to apply to arbitrary 
similarity measures. Suppose that we pick a threshold value sy and say that x; is 
similar to x; if s(x;, xj) > so. This defines an n-by-n similarity matriz S = [sij], with 
binary component 


1 if s(x;, Xj) > So 

Sij = { 0 otherwise. (89) 
Furthermore, this matrix induces a similarity graph, dual to S, in which nodes corre- 
spond to points and an edge joins node ¿ and node j if and only if s;; = 1. 

The clusterings produced by the single-linkage algorithm and by a modified version 
of the complete-linkage algorithm are readily described in terms of this graph. With 
the single-linkage algorithm, two samples x and x’ are in the same cluster if and only 
if there exists a chain x,x,,X9,...,X,,xX’ such that x is similar to x1, x; is similar to 
X2, and so on for the whole chain. Thus, this clustering corresponds to the connected 
components of the similarity graph. With the complete-linkage algorithm, all samples 
in a given cluster must be similar to one another, and no sample can be in more than 
one cluster. If we drop this second requirement, then this clustering corresponds to 
the maximal complete subgraphs of the similarity graph — the “largest” subgraphs 
with edges joining all pairs of nodes. (In general, the clusters of the complete-linkage 
algorithm will be found among the maximal complete subgraphs, but they cannot be 
determined without knowing the unquantized similarity values.) 
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In the preceding section we noted that the nearest-neighbor algorithm could be 
viewed as an algorithm for finding a minimal spanning tree. Conversely, given a 
minimal spanning tree we can find the clusterings produced by the nearest-neighbor 
algorithm. Removal of the longest edge produces the two-cluster grouping, removal of 
the next longest edge produces the three-cluster grouping, and so on. This amounts 
to a divisive hierarchical procedure, and suggests other ways of dividing the graph 
into subgraphs. For example, in selecting an edge to remove, we can compare its 
length to the lengths of other edges incident upon its nodes. Let us say that an edge 
is inconsistent if its length 1 is significantly larger than l, the average length of all 
other edges incident on its nodes. Figure 10.20 shows a minimal spanning tree for a 
two-dimensional point set and the clusters obtained by systematically removing all 
edges for which 1 > 2/ in this way. This criterion is sensitive to local conditions gives 
results that are quite different from merely removing the two longest edges. 


at 


Figure 10.20: The removal of inconsistent edges — ones with length significantly larger 
than the average incident upon a node — may yield natural clusters. The original 
data is shown at the left and its minimal spanning tree is shown in the middle. At 
virtually every node, incident edges are of nearly the same length. Each of the two 
nodes shown in red are exceptions: their incident edges are of very different lengths. 
When the two such inconsistent edges are removed, three clusters are produced, as 
shown at the right. 


When the data points are strung out into long chains, a minimal spanning tree 
forms a natural skeleton for the chain. If we define the diameter path as the longest 
path through the tree, then a chain will be characterized by the shallow depth of 
the branching off the diameter path. In contrast, for a large, uniform cloud of data 
points, the tree will usually not have an obvious diameter path, but rather several 
distinct, near-diameter paths. For any of these, an appreciable number of nodes will 
be off the path. While slight changes in the locations of the data points can cause 
major rerouting of a minimal spanning tree, they typically have little effect on such 
statistics. 


One of the useful statistics that can be obtained from a minimal spanning tree 
is the edge length distribution. Figure 10.21 shows a situation in which two dense 
clusters are embedded in a sparse set of points; the lengths of the edges of the min- 
imal spanning tree exhibit two distinct clusters which would easily be detected by a 
minimum-variance procedure. By deleting all edges longer than some intermediate 
value, we can extract the dense cluster as the largest connected component of the 
remaining graph. While more complicated configurations can not be disposed of this 
easily, the flexibility of the graph-theoretic approach suggests that it is applicable to 
a wide variety of clustering problems. 


10.13. COMPONENT ANALYSIS 53 


number 


SN BHRUANDS 


number 
SULLANA 


length length 


Figure 10.21: A minimal spanning tree is shown at the left; its bimodal edge length 
distribution is evident in the histogram below. If all links of intermediate or high 
length are removed (red), the two natural clusters are revealed (right). 


10.13 Component analysis 


Component analysis is an unsupervised approach to finding the “right” features from 
the data. We shall discuss two leading methods, each having a somewhat different 
goal. In principal component analysis (PCA), we seek to represent the d-dimensional 
data in a lower-dimensional space. This will reduce the degrees of freedom, reduce 
the space and time complexities. The goal is to represent data in a space that best 
describes the variation in a sum-squared error sense, as we shall see. In independent 
component analysis (ICA) we seek those directions that show the independence of 
signals. This method is particularly helpful for segmenting signals from multiple 
sources. As with standard clustering methods, it helps greatly if we know how many 
independent components exist ahead of time. 


10.13.1 Principal component analysis (PCA) 


The basic approach in principal componements or Karhunen-Loéve transform is con- 
ceptually quite simple. First, the d-dimensional mean vector u and d x d covariance 
matrix X are computed for the full data set. Next, the eigenvectors and eigenvalues 
are computed (cf. Appendix ??), and sorted according to decreasing eigenvalue. Call 
these eigenvectors e, with eigenvalue A,, ez with eigenvalue A2, and so on. Next, the 
largest k such eigenvectors are chosen. In practice, this is done by looking at a spec- 
trum of eigenvectors. Often there will be xxx implying an inherent dimensionality of 
the subspace governing the “signal.” The other dimensions are noise. Form ak x k 
matrix A whose columns consist of the k eigenvectors. 
Preprocess data according to: 


x’ = A*(x— p). (90) 


It can be shown that this representation minimizes a squared error criterion (Prob- 
lem 42). 
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10.13.2 Non-linear component analysis 


We have just seen how to find a k-dimensional linear subspace of feature space that 
best represents the full data according to a minimum-square-error sense. If the data 
set is not well described by a sample mean and covariance matrix, but instead in- 
volves complicated interactions of features, then the linear subspace may be a poor 
representation. In such a case a non-linear component may be needed. 

A neural network approach to such non-linear component analysis employs a net- 
work with five layers of units, as shown in Fig. 10.22. The middle layer consists of 
k < d linear units, and it is here that the non-linear components will be revealed. 
It is important that the two other internal layers have nonlinear units (Problem 44). 
The entire network is trained using the techniques of Chapt. ?? as an auto-encoder 
or auto-associator. That is, each d-dimensional pattern is presented as input and as 
the target or desired output. When trained on a sum-squared error criterion, such a 
network readily learns the auto-encoder problem. 

The top two layers of the trained network are discarded, and the rest used for 
non-linear components. For each input pattern x, the output of the k units of the 
three-layer network correspond to the non-linear components. 


output 
x % 


H © OOOO 4 
nonlinear GP) © E (6) E) F, 


nonlinear 


©) 


®OOOO 


input 


x, 


Figure 10.22: A five-layer neural network with two layers of non-linear units (e.g., 
sigmoidal), trained to be an auto-encoder, develops an internal representation that 
corresponds to the non-linear principal components of the full data set. (Bias units 
are not shown.) The process can be viewed in feature space (at the right). The 
transformation F; is a non-linear projection onto a k-dimensional non-linear subspace 
denoted T(F2). Points in T(F2) are mapped via Fə back to the the d-dimensional 
space of the original data. 


We can understand the function of the full five-layer network in terms of two succes- 
sive mappings, F, is a projection from the d-dimensional input onto a k-dimensional 
nonlinear subspace, followed by F2, a mapping from that subspace back to the full 
d-dimensional space, as shown in the right of the figure. 

Learning in the original network is highly nonlinear, and during training care 
must be taken so as to avoid a poor local minimum (Chap. ??). Naturally, one 
must take care to set an appropriate number k of units. Recall that in (linear) 
principal component analysis, the number of components k could be chosen based 
on the spectrum of eigenvectors. If the eigenvalues are ordered by magnitude, any 
significant drop between successive values indicates a “natural” number dimension 
to the subspace. Likewise, suppose five-layer networks are trained, with different 
numbers k of units in the middle layer. Assuming poor local minima have been 
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avoided, the training error will surely decrease for successively larger values of k. If 
the improvement k + 1 over k is small, this may indicate that k is the “natural” 
dimension of the nonlinear subspace. 

We should not conclude that principal component analysis is always beneficial 
for classification. If the noise is large compared to the difference between categories, 
then component analysis will find the directions of the noise, rather than the signal, 
as illustrated in Fig. 10.23. In such cases, we seek to ignore the noise, and instead 
extract the directions that are indicative of the categories — a technique we consider 
next. 


ZI 


Figure 10.23: Features from two classes are as shown, along with nonlinear compo- 
nents of the full data set. Apparently, these classes are well separated along the y2 
direction, but the large noise gives the largest nonlinear component to be y1. Prepro- 
cessing by keeping merely the largest nonlinear component would retain the “noise” 
and discard the “signal,” giving poor recognition. The same defect arises in linear 
principal components, where the compoenents are linear and everywhere perpendic- 
ular. 


10.13.3 *Independent component analysis (ICA) 


Suppose there are c independent scalar source signals 2;(t) for i = 1,...c where we 
can consider t to be a time index 1 < t < T. For notational convenience we group the 
c values at an instant into a vector x(t) and assume, further, that the vector has zero 
mean. Because of our independence assumption, and an assumption of no noise, we 
the multivariate density function can be written as 


pælt) = J [ pai). (91) 
i=1 
Suppose that a d-dimensional data (or sensor) vector is observed at each moment, 


y(t) = Ax(t), (92) 


where A is a c x d scalar matrix, and below we shall require d > c. 

The method is perhaps best illustrated in its most typical use. Suppose there are 
c sound sources being sensed by d microphones, all in a room. Each microphone gets 
a mixture of the sources, with amplitudes depending upon the distances (Fig. 10.24). 
(We shall ignore any effects of delays.) 
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Figure 10.24: Independent component analysis (ICA) is an unsupervised method that 
can be applied to the problem of blind source separation. In such problems, two or 
more source signals (assumed independent) x(t), w(t), ..., va(t) are combined to 
yield a sum signal, s;(t)+s2(t)+...+8-(t) where c > d. (This figure illustrates a case 
with only two components.) Given merely the linear signals, and the assumption of 
the number of components, d, the task of ICA is to recover the source signals. This 
is equivalent to finding a matrix W that is the inverse of A. In general appalications 
of ICA, one seeks to extraction independent components from the sensed signals, 
whether or not they arose from a linear mixture of initial sources. 


The task of independent component analysis is to recover the source signals from 
the sensed signals. More specifically, we seek a real matrix W such that 


z(t) = Wy(t) = WAx(t), (93) 


where z is an estimate of the sources x(t). Of couse we seek W = A”!, but neither 
A nor its inverse are known. 

We approach the determination of A by maximum-likelihood techniques. We use 
an estimate of the density, parameterized by a f(y; a) and seek the parameter vector 
a that minimizes the diffrerence between the source distribution and the estimate. 
That is, a is the basis vectors of A and thus p(y; a) is an estimate of the p(y). 

This difference can be quantified by the Kullback-Liebler divergence: 


D(ply), Ply; a)) = D(p(y)|lb(y; a)) 


p(y) 
Jr E a dy 


H(y)- J p(y )logp(y; a)dy (94) 


The log-likelihood is 


la) = Y logos a). (95) 
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and using the law of large numbers, the Kullback-Liebler divergence can be written 
as 


~ 

peee 
y 

== 


— J p(y)logp(y)dy — J ply)log PO) dy 

= Aly) —D(ply)llp(y; a)), (96) 
Sa 

indep. of W 


where the entropy H(y) is independent of W. Thus we maximize the log-likelihood 
by minimizing the Kullback-Liebler divergence with respect to the estimated density 


p(y; a): 
Ola) a 


aw ~ gw? PCY) IPO: a)). (97) 


Because A is an invertible matrix, and because the Kullback-Liebler divergence is 
invariant under invertible transformation (Problem 47), we have 


al a 
Ma) L Ê DOIE). (98) 
OH(yyy) _ ð ð T | Oru; 
awww ~ gwww WWwll + yw! Tí | ðyyi | 
= [WWW]! — ¿(xxx)zzz!, (99) 


where (xxx) is the score function, the gradient fector of the log likelihood: 


Poles et 
(a 
Opíz)/0z ú ; 
gla) = PAZ - : (100) 
Op(zq)/Ozq 
P(Zq) 
Thus the learning rule is 
H 
o E ba]! — da) yy. (101) 
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A simpler form comes if we merely scale, following the natural gradient 


OH (xxx) 
OXXX 

This, then is the learning algorithm. 

An assumption is that at most one of the sources is Gaussian distributed (Prob- 
lem 46). Indeed this method is most successful if the distributions are highly skewed 
or otherwise deviate markedly from Gaussian. 

We can understand the difference between PCA and ICA in the following way. 
Imagine that there were two sources that are correlated and large correlated signals 
in a particular direction. PCA would find that direction, and indeed would reduce the 
sum-squared error. Such components are not independent, and would not be useful 
for separating the sources. As such, they would not be found by ICA. Instead, ICA 


Axxx X WW'WW = [I — ¿(xx)xx"] WWW. (102) 
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would find those directions that are best for separating the sources — even if those 
directions have small eigenvectors. 

Generally speaking, when used as preprocessing for classification, independent 
component analysis has several characteristics that make it more desirable than linear 
or non-linear principal component analysis. As we saw in Fig. 10.23, such principal 
components need not be effective in separating classes. Recall that the sensed input 
consists of a signal (due to the true categories) plus noise. If the noise is large much 
larger than the signal, principal components will depend more upon the noise than on 
the signal. Since the different categories are, we assume, independent, independent 
component analysis is likely to extract those features that are useful in distinguishing 
the classes. 


10.14 Low-Dimensional Representations and Multi- 
dimensional Scaling (MDS) 


Part of the problem of deciding whether or not a given clustering means anything 
stems from our inability to visualize the structure of multidimensional data. This 
problem is further aggravated when similarity or dissimilarity measures are used that 
lack the familiar properties of distance. One way to attack this problem is to try to 
represent the data points as points in some lower-dimensional space in such a way 
that the distances between points in the that space correspond to the dissimilarities 
between points in the original space. If acceptably accurate representations can be 
found in two or perhaps three dimensions, this can be an extremely valuable way to 
gain insight into the structure of the data. The general process of finding a configura- 
tion of points whose interpoint distances correspond to similarities or dissimilarities 
is often called multidimensional scaling. 

Let us begin with the simpler case where it is meaningful to talk about the dis- 
tances between the n samples x1,...,Xn. Let y; be the lower-dimensional image of 
Xi, dj; be the distance between x; and x,, and dj; be the distance between y, and 
y; (Fig. 10.25). Then we are looking for a configuration of image points y1,...,Yn 
for which the n(n — 1)/2 distances d;; between image points are as close as possi- 
ble to the corresponding original distances 6;;. Since it will usually not be possible 
to find a configuration for which d;i; = 6;; for all ¿ and j, we need some criterion 
for deciding whether or not one configuration is better than another. The following 
sum-of-squared-error functions are all reasonable candidates: 


Je = ~~] — (103) 
Jip = Da ) (104) 


2 
je = l (105) 
i Y dij 2 bij 


i<j 


Since these criterion functions involve only the distances between points, they are 
invariant to rigid-body motions of the configurations. Moreover, they have all been 
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Figure 10.25: The distance between points in the original space are 6;; while in the 
projected space d;¿. In practice, the source space is typically of very high dimension, 
and the mapped space of just two or three dimensions, to aid visualization. (In order 
to illustrate the correspondence between points in the two spaces, the size and color 
of each point x; matches that of its image y;. 


normalized so that their minimum values are invariant to dilations of the sample 
points. While Jee emphasizes the largest errors (regardless whether the distances 6;; 
are large or small), J; emphasizes the largest fractional errors (regardless whether 
the errors |d;; —6;;| are large or small). A useful compromise is Jef, which emphasizes 
the largest product of error and fractional error. 

Once a criterion function has been selected, an optimal configuration y1,..., Yn 
is defined as one that minimizes that criterion function. An optimal configuration 
can be sought by a standard gradient-descent procedure, starting with some initial 
configuration and changing the y;,’s in the direction of greatest rate of decrease in the 
criterion function. Since 


diz = |lyi — y ll, 


the gradient of d;; with respect to y; is merely a unit vector in the direction of y;—y;. 
Thus, the gradients of the criterion functions are easy to compute: 


2 Yr — Yj 
Vy,Jee = YE X (dr; — i 
io FAR : 
> dj — Ôkj Yk — Yj 
Vy.Jtf = 25 Sz : dics 7 
jzk “kj ky 
2 dkj — Obj Yk — Yj 
VyrJef = Nós S J de J 
24°00 jak j j 


The starting configuration can be chosen randomly, or in any convenient way that 
spreads the image points about. If the image points lie in a d-dimensional space, 


MONO- 
TONICITY 
CONSTRAINT 


60 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 


then a simple and effective starting configuration can be found by selecting those d 
coordinates of the samples that have the largest variance. 

The following example illustrates the kind of results that can be obtained by these 
techniques. The data consist of thirty points spaced at unit intervals along a spiral 
in three-dimensions: 


ai(k) = cos(k/V2) 

ao(k) = sin(k/V2) 

a3(k) = k/V2, k=0,1,...,29. 
Figure 10.26 shows a the three-dimensional data. When the Jep criterion was used, 
twenty iterations of a gradient descent procedure produced the two-dimensional con- 


figuration shown at the right. Of course, translations, rotations, and reflections of 
this configuration would be equally good solutions. 


Figure 10.26: Thirty points of the form (cos(k/V2),sin(k/V2),k//2)' for k = 
0,1,...,29 are shown at the left. Multidimensional scaling using the Jef criterion 
(Eq. 105) and a two-dimensional target space leads to the image points shown at the 
right. This lower-dimensional representation shows clearly the fundamental sequential 
nature of the points in the original, source space. 


In non-metric multidimensional scaling problems, the quantities 6;; are dissimi- 
larities whose numerical values are not as important as their rank order. An ideal 
configuration would be one for which the rank order of the distances dj; is the same as 
the rank order of the dissimilarities 6,;. Let us order the m = n(n—1)/2 dissimilarities 
so that ij <e <ó0 
constraint 


and let dij be any m numbers satisfying the monotonicity 


tmjm? 


diaj S disia S S dimin" (106) 


In general, the distances d;i; will not satisfy this constraint, and the numbers dij 
will not be distances. However, the degree to which the dij satisfy this constraint is 
measured by 


Jmon = min Y (di — diz)’, (107) 
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where it is always to be understood that the di; must satisfy the monotonicity 
constraint. Thus, AR measures the degree to which the configuration of points 
Y1,---;¥n represents the original data. Unfortunately, E can not be used to define 
an optimal configuration because it can be made to vanish by collapsing the config- 
uration to a single point. However, this defect is easily removed by a normalization 
such as the following: 


dmon 
Jmon = | (108) 
Y dij 


i<j 


Thus, Jmon is invariant to translations, rotations, and dilations of the configura- 
tion, and an optimal configuration can be defined as one that minimizes this criterion 
function. It has been observed experimentally that when the number of points is 
larger than dimensionality of the image space, the monotonicity constraint is actually 
quite confining. This might be expected from the fact that the number of constraints 
grows as the square of the number of points, and it is the basis for the frequently 
encountered statement that this procedure allows the recovery of metric information 
from nonmetric data. The quality of the representation generally improves as the 
dimensionality of the image space is increased, and it may be necessary to go beyond 
three dimensions to obtain an acceptably small value of Jmon. However, this may be 
a small price to pay to allow the use of the many clustering procedures available for 
data points in metric spaces (Problem ??). 


10.14.1 Self-organizing feature maps 


A method closely related to multidimensional scaling is that of self-organizing fea- 
ture maps, sometimes called topologically ordered maps or Kohonen self-organizing 
feature maps. As before, the goal is to represent all points in the source space by KOHONEN 
points in a target space, such that distance and proximity relationships are preserved MAPS 
as much as possible. The self-organizing map algorithm we shall discuss does not 
require the storage of a large number of samples, and thus has much lower space 
complexity than multidimensional scaling. (In practice, both methods have high time 
complexities.) Moreover, the method is particularly useful when there is a nonlinear 
mapping inherent in the problem itself, as we shall see. 

It is simplest to explain self-organizing maps by means of an example. Suppose 
we seek to learn a mapping from a circular disk region (the source space) to a target 
space, as shown in Fig. 10.27. The source space is sensed by a movable two-joint 
arm of fixed segment lengths; thus each point (x1, 22) in the disk area leads to a pair 
of angles (41,2), which we denote as a vector @. The algorithm uses a sequence 
of @ values but not the (1,22) values themselves, since they and their nonlinear 
transformation are not directly accessible. In our illustration the nonlinearity involves 
inverse trigonometric functions, but in most applications it is more complicated and 
not even known. 

The task is this: given a sequence of @’s (corresponding to points sampled in the 
source space), create a mapping from @ to y such that points neighboring in the 
source space are mapped to points that are neighboring in the target space. It is 
this goal of preserving neighborhoods that gives the resulting “topologically ordered 
maps” their name. 

The mapping is learned by a simple two-layer neural network, here with two inputs 
($1 and ¢2), fully connected to a large number of outputs, corresponding to points 
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A(ly* - yl) 


/ yN target space 


Figure 10.27: A self-organizing map from the (two-dimensional) disk source space to 
the (one-dimensional) line of the target space can be learned as follows. For each 
point x in the target line, there exists a corresponding point in the source space that, 
if sensed, would lead to x begin most active. For clarity, then, we can link theses 
points in the source; it is as if the image line is placed in the source space. At the 
state shown, the particular sensed point leads to x* begin most active. The learning 
rule (Eq. 109) makes its source point move toward the sensed point, as shown by the 
small arrow. Because of the window function A(|y* — y|), points adjacent to x* are 
also moved toward the sensed point, thought not as much. If such learning is repeated 
many times as the arm randomly senses the whole source space, a topologically correct 
map is learned. 


along the target line. When a pattern @, each node in the target space computes its 
net activation, netk = >> ġiWpi. One of the units is most activated; call it y*. The 


2 
weights to this unit and those in its immediate neighborhood are updated according 
to: 


wilt +1) = wrilt) +A (ly = y"D0,, (109) 


where n(t) is a learning rate which depends upon the iteration number t. Next, 
every weight vector is normalized such that |w| = 1. (Naturally, only those weight 
vectors that have been altered during the learning trial need be re-normalized.) The 
function A(|y —y*|) is called the “window function,” and has value 1.0 for y = y* and 
smaller for large values of |y — y*|. The window function is vital to the success of the 
algorithm: it insures that neighboring points in the target space have weights that 
are similar, and thus correspond to neighboring points in the source space, thereby 
insuring topological neighborhoods (Fig. 10.28). The learning rate 7(t) decreases 
slowly as a function of iteration number (i.e., as patterns are presented) to insure 
that learning will ultimately stop. 

Equation 109 has a particularly straightforward interpretation. For each pattern 
presentation, the “winning” unit in the target space (y*) is adjusted so that it is more 
like the particular pattern. Others in the neighborhood of y* are also adjusted so that 
their weights more nearly match that of the input pattern (though not quite as much 
as for y*, according to the window function). In this way, neighboring points in the 
input space lead to neighboring points being active. 

After are large number of pattern presentations, learning according to Eq. 109 
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Figure 10.28: Typical window functions for self-organizing maps for target spaces in 
one dimension (left) and two dimensions (right). In each case, the weights at the 
maximally active unit, yx*, in the target space get the largest weight update while 
units more distant get smaller update. 


insures that neighboring points in the source space lead to neighboring points in the 
target space. Informally speaking, it is as if the target space line has been placed on 
the source space, and learning pulls and stretches the line to fill the source space, as 
illustrated in Fig. 10.29 shows the development of the map. After 150000 training 
presentations, a topological map has been learned. 


0 20 100 1000 10000 


25000 50000 75000 100000 150000 


VOGOE 


Figure 10.29: If a large number of pattern presentations are made using the setup of 
Fig. 10.27, a topologically ordered map develops. The number of pattern presentations 
is listed. 


The learning of such self-organizing maps is very general, and can be applied 
to virtually any source space, target space and continuous nonlinear mapping. Fig- 
ure 10.30 shows the development of a self-organizing map from a square source space 
to a square (grid) target space. 

There are generally inherent ambiguities in the maps learned by this algorithm. 
For instance, a mapping from a square to a square could eight possible orientations, 
corresponding to the four rotation and two flip symmetries. Such ambiguity is gen- 
erally irrelevant for subsequent clustering or classification in the target space. Nev- 
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100 1000 50000 


300000 


Figure 10.30: A self-organizing feature map from a square source space to a square 
(grid) target space. As in Fig. 10.27, each grid point of the target space is shown atop 
the the point in the source space that leads maximally excites that target point. This 
example also used the non-linear 


ertheless the mapping ambiguities are related to a more significant drawback — the 
possibility of “kinks” in the map. A particular initial condition can lead to part of 
the map learning one of the orientations, while a different part learns another one 
(Fig. 10.31). When this occurs, it is generally best to re-initialize the weights ran- 
domly and restart the learning with perhaps a wider window function or slower decay 
in the learning rate. 


0 1000 


400000 


Figure 10.31: Some initial (random) weights and the particular sequence of patterns 
(randomly chosen) lead to kinks in the map; even extensive further training does 
not eliminate the kink. In such cases, learning should be re-started with randomized 
weights and possibly a wider window function and slower decay in learning. 


One of the benefits of this learning algorithm is that it naturally takes account 
of the probability of sampling in the source space, i.e., p(x). Regions of high such 
probability attract more of the points in the target space, and this yields xxx, as 
shown in Fig. 10.32. Thus in the target space, xxx points are spread apart — just as 
we would want for preprocessing for subsequent classification. 

Another issue is the number of dimensions in the target space. One typically 
chooses this dimension (and 

run in unsupervised mode — track slow changes. 

Such self-organizing feature maps can be used in a number of systems. For in- 
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0 1000 400000 800000 


Figure 10.32: Uneven density: 20 times more likely to choose a point in the center 
(density is 20 times greater). 


stance, one can take a fairly large number (e.g., 12) of temporal frequency filter 
outputs and use their output to map to a two-dimensional target space. When such 
an approach is applied to spoken vowel sounds, similar utterances such as /ee/ and 
/eh/ will be close together, while others, e.g., /ee/ and /oo/, will be far apart — 
just as we had in multidimensional scaling. Subsequent supervised learning can label 
regions in this target space, and thus lead to a full classifier, but one formed using 
only a small amount of supervised training. 


10.14.2 Clustering and Dimensionality Reduction 


Because the curse of dimensionality plagues so many pattern recognition procedures, 
a variety of methods for dimensionality reduction have been proposed. Unlike the 
procedures that we have just examined, most of these methods provide a functional 
mapping, so that one can determine the image of an arbitrary feature vector. The 
classical procedures of statistics are principal components analysis and factor analysis, | PRINCIPAL 
both of which reduce dimensionality by forming linear combinations of the features. COMPO- 
The object of principal components analysis (known in communication theory as the NENT 
Karhunen-Loéve expansion) is to find a lower-dimensional representation that ac- 
counts for the variance of the features. The object of factor analysis is to find a 
lower-dimensional representation that accounts for the correlations among the fea- 
tures. If we think of the problem as one of removing or combining (i.e., grouping) 
highly correlated features, then it becomes clear that the techniques of clustering are 
applicable to this problem. In terms of the data matrix, whose n rows are the d- DATA 
dimensional samples, ordinary clustering can be thought of as a grouping of the rows, | MATRIX 
with a smaller number of cluster centers being used to represent the data, whereas di- 
mensionality reduction can be thought of as a grouping of the columns, with combined 
features being used to represent the data. 

Let us consider a simple modification of hierarchical clustering to reduce dimen- 
sionality. In place of an n-by-n matrix of distances between samples, we consider a 
d-by-d correlation matrix R = [pij], where the correlation coefficient p;; is related to CORRELA- 
the covariances (or sample covariances) by TION 

MATRIX 


FACTOR 
ANALYSIS 


Oij 
Oii jj 


Since 0 < pj; < 1, with pz, = 0 for uncorrelated features and pz, = 1 for completely 


pij = (110) 


correlated features, pi plays the role of a similarity function for features. Two features 
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for which Ps; is large are clearly good candidates to be merged into one feature, thereby 
reducing the dimensionality by one. Repetition of this process leads to the following 
hierarchical procedure: 


Algorithm 8 (Hierarchical dimensionality reduction) 


1 begin initialize d', D; — {x;},i=1,...,d 


2 d—d+l1 

3 do d=d-1 

4 compute R by Eq. 110 

5 Find most correlated distinct clusters, say D; and D; 
6 D; = D; U D; merge 

7 delete D; 

8 until d = d’ 

9 return d’ clusters 
10 end 


Probably the simplest way to merge two groups of features is just to average them. 
(This tacitly assumes that the features have been scaled so that their numerical ranges 
are comparable.) With this definition of a new feature, there is no problem in defining 
the correlation matrix for groups of features. It is not hard to think of variations on 
this general theme, but we shall not pursue this topic further. 

For the purposes of pattern classification, the most serious criticism of all of the 
approaches to dimensionality reduction that we have mentioned is that they are overly 
concerned with faithful representation of the data. Greatest emphasis is usually placed 
on those features or groups of features that have the greatest variability. But for 
classification, we are interested in discrimination — not representation. While it is a 
truism that the ideal representation is the one that makes classification easy, it is not 
always so clear that clustering without explicitly incorporating classification criteria 
will find such a representation. Roughly speaking, the most interesting features are 
the ones for which the difference in the class means is large relative to the standard 
deviations, not the ones for which merely the standard deviations are large. In short, 
we are interested in something more like the method of multiple discriminant analysis 
described in Sect. ??. 

There is a large body of theory on methods of dimensionality reduction for pattern 
classification. Some of these methods seek to form new features out of linear combi- 
nations of old ones. Others seek merely a smaller subset of the original features. A 
major problem confronting this theory is that the division of pattern recognition into 
feature extraction followed by classification is theoretically artificial. A completely 
optimal feature extractor can never by anything but an optimal classifier. It is only 
when constraints are placed on the classifier or limitations are placed on the size of 
the set of samples that one can formulate nontrivial (or very complicated) problems. 
Various ways of circumventing this problem that may be useful under the proper cir- 
cumstances can be found in the literature. When it is possible to exploit knowledge 
of the problem domain to obtain more informative features, that is usually the most 
profitable course of action. 


Summary 


Unsupervised learning and clustering seek to extract information from unlabeled sam- 
ples. If the underlying distribution comes from a mixture of component densities de- 
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scribed by a set of unknown parameters 0, then 0 can be estimated by Bayesian or 
maximum-likelihood methods. A more general approach is to define some measure of 
similarity between two clusters, as well as a global criterion such as a sum-squared- 
error or trace of a scatter matrix. Since there are only occasionally analytic methods 
for computing the clustering which optimizes the criterion, a number of greedy (lo- 
cally step-wise optimal) iterative algorithms can be used, such as k-means and fuzzy 
k-means clustering. 

If we seek to reveal structure in the data at many levels — i.e., clusters with sub- 
clusters and sub-subcluster — then hierarchical methods are needed. Agglomerative 
or bottom-up methods start with each sample as a singleton cluster and iteratively 
merge clusters that are “most similar” according to some chosen similarity or dis- 
tance measure. Conversely, divisive or top-down methods start with a single cluster 
representing the full data set and iteratively splitting into smaller clusters, each time 
seeking the subclusters that are most dissimilar. The resulting hierarchical structure 
is revealed in a dendrogram. A large disparity in the similarity measure for successive 
cluster levels in a dendrogram usually indicates the “natural” number of clusters. Al- 
ternatively, the problem of cluster validity — knowing the proper number of clusters 
— can also be addressed by hypothesis testing. In that case the null hypothesis is 
that there are some number c of clusters; we then determine if the reduction of the 
cluster criterion due to an additional cluster is statistically significant. 

Competitive learning is an on-line neural network clustering algorithm in which 
the cluster center most similar to an input pattern is modified to become more like 
that pattern. In order to guarantee that learning stops for an arbitrary data set, 
the learning rate must decay. Competitive learning can be modified to allow for 
the creation of new cluster centers, if no center is sufficiently similar to a particular 
input pattern, as in leader-follower clustering and Adaptive Resonance. While these 
methods have many advantages, such as computational ease and tracking gradual 
variations in the data, they rarely optimize an easily specified global criterion such as 
sum-of-squared error. 

Graph theoretic methods in clustering treat the data as points, to be linked ac- 
cording to a number of heuristics and distance measures. The clusters produced by 
these methods can exhibit chaining or other intricate structures, and rarely optimize 
an easily specified global cost function. Graph methods are, moreover, generally more 
sensitive to details of the data. 

Component analysis seeks to find directions or axes in feature space that provide 
an improved, lower-dimensional representation for the full data space. In (linear) 
principal component analysis, such directions are merely the largest eigenvectors of 
the covariance matrix of the full data; this optimizes a sum-squared-error criterion. 
Nonlinear principal components, for instance as learned in an internal layer an auto- 
encoder neural network, yields curved surfaces embedded in the full d-dimensional 
feature space, onto which an arbitrary pattern x is projected. The goal in independent 
component analysis — which uses gradient descent in an entropy criterion — is to 
determine the directions in feature space that are statistically most independent. 
Such directions may reveal the true sources (assumed independent) and can be used 
for segmentation and blind source separation. 

Two general methods for dimensionality reduction is self-organizing feature maps 
and multidimensional scaling. Self-organizaing feature maps can be highly nonlinear, 
and represents points close in the source space by points close in the lower-dimensional 
target space. In preserving neighborhoods in this way, such maps also called “topolog- 
ically correct.” The source and target spaces can be of very general shapes, and the 
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mapping will depend upon the the distribution of samples within the source space. 
Multidimensional scaling similarly learns a nonlinear mapping that, too, seeks to 
preserve neighborhoods, and is often used for data visualization. Because the basic 
method requires all the inter-point distances for minimizing a global criterion function, 
its space complexity limits the usefulness of multidimensional scaling to problems of 
moderate size. 


Bibliographical and Historical Remarks 


Historically, the literature on unsupervised learning and clustering dates to Karl Pear- 
son, who in 1894 used sample moments to determine the parameters in a misture of 
two univariate Gaussians. While most books on pattern classification address un- 
supervised learning, there are several modern books[21, 1] and review articles on 
unsupervised learning that go into great detail. Much of the work on unsupervised 
methods comes from the signal compression community, where vector quantization 
(VQ) seeks to represent an arbitrary vector by one of c vectors prototype vectors 
corresponding to our clusters [17]. 

A clear book on mixture models is [29]. The issue of identifiability in unsupervised 
learning is [37]. Hasselblad showed how the parameters of one-dimensional normals 
could be learned in an unsupervised environment [19]. The k-means algorithm was 
introduced in a paper by Lloyd [28], which inspired many variations (including fuzzy” 
ones [4, 5]) and computational improvements. 

Efficient agglomerative methods for hierarchical clustering are summarized in [10]. 

The key mathematical concepts underlying principal component analysis appear 
in [22] as well as [7, 26, 11], which stress neural implementation. Independent compo- 
nent analysis was introduced by Jutten and Herault[23], and the maximum-likelihood 
approach introduced by Gaeta and Lacoume[15] Generalizations and a maximum- 
likelihood approach are given in [32]. Bell and Sejnowski [3] showed a neural network. 
A good compendium is [38]. Another Perlmutter paper [31]. Several studies have 
shown the benefits of ICA for classification [13]. 

Multidimensional scaling discussed in [34, 6] and its relationship to clustering is 
explored in [27]. 

The classificatory foundations of biology, cladistics (from the Greek klados, branch) 
provide useful background for the use of classification in all scientific fields [14]. 

Kohonen’s long series of papers on self-organizing feature maps began in the early 
1980s [24] and a good compendium can be found in [25]; convergence properties 
of algorithms for self-organizing feature maps are proved in [39]. There have been 
numerous applications of the method, from speech to finding patterns of poverty in 
the world. 

Also goes under the name Learning Vector Quantization (LVQ). 

The main emphasis of research on Adaptive Resonance has been to explore [8, 
Chapter 10] A wonderfully clear exposition of the central algorithmic ideas is [30]; 
an attempt to translate the ideas and terminology of adaptive resonance, including a 
glossary, is given in [36]. 


Problems 
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1. Suppose that x can assume the values 0,1,...,m and that P(x|0) is a mixture of 
c binomial distributions 


- m m m>—x 
Pleo) = > (ana otr), 
j=1 
where @ is a vector of length c representing the parameters in the distributions. 


(a) Assuming that the prior probabilities P(w,;) are known, explain why this mixture 
is not identifiable if m < c. 


(b) Under these conditions, is the mixture completely unidentifiable? 
(c) How do your answers above change if the prior probabilities are also unknown? 


2. Consider a mixture distribution of two triangle distributions, where component 
density w; is centered on u; and has “halfwidth” w;, according to: 


1— |x —- 2w;) for jx — il < wi 
p(a|wi) ~ T(m, wi) = { ‘ | as oe 


(a) Assume P(w1) = P(w2) = 0.5 and derive the equations for the maximum- 
likelihood values fi; and w;, i = 1,2. 


(b) Under the conditions in part (a), is the distribution identifiable? 


(c) Assume that both widths w; are known, but the centers are not. Assume, too, 
that there exist values for the centers that give non-zero probability to each of 
the samples. Derive a formula for the maximum-likelihood value of the centers. 


(d) Under the conditions in part (c), is the distribution identifiable? 


3. Suppose there is a one-dimensional mixture density consisting of two Gaussian 
components, each centered on the origin: 


p(2|@) = P(w1) 


1 1 
e (07) de a _ P(w1)) 2/20) 


27101 21702 
and 0 = (P(w1), 01,02) describes the parameters. 
(a) Show that under these conditions this density is completely unidentifiable. 
(b) Suppose the value P(w1) is fixed and known. Is the model identifiable? 


(c) Suppose cı and 02 are known, but P(w,) is unknown. Is this resulting model 
identifiable? That is, can P(w,) be identified using data? 
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4. Let x be a d-component binary vector (0,1) and P(x|@) be a mixture of c multi- 
variate Bernoulli distributions, 


P(x|@) = D P(x|w;, 0i) P(u%) 
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where 


d 
po 
P(xlw;, 0 i) = 119 2 ( (1— oes) 2 


(a) Derive the formula for the partial derivative: 


O In P(x|wi, 8i) _ Xi — 05; 


(b) Using the general equations for maximum-likelihood estimates, show that the 
maximum-likelihood estimate 0, for 8; must satisfy 


X P(wi|xx, 01) Xr 


(c) Interpret your answer to part (b) in words. 


5. Let p(x|@) be a c-component normal mixture with p(x|w;,0;) ~ N(p;, 071). Using 
the results of Sect. ??, show that the maximum-likelihood estimate for o? must satisfy 


1/d Y P(wilxn, 01) llxw — Hill? 
— k=1 


m 


2 P(wilxx, ði) 


where f; and Plw; Xp, 6;) are given by Eqs. 20 & 22, respectively. 

6. The derivation of the equations for maximum-likelihood estimation of parameters 
of a mixture density was made under the assumption that the parameters in each 
component density are functionally independent. Suppose instead that 


p(x|a) = En xl, 0) P(w5), 


where a is a parameter that appears in several (and possibly all) of the component 
densities. Let l be the n-sample log-likelihood function, and show that 


n 


: Oln p(xk|wj, a 
=D Y Pijler, o) Peelen) 


k=1 j=1 


where 


pesao, 0) P(w) 


Pb) aja) 


7. Let 0, and 02 be unknown parameters for the component densities p(a|w 1,61) and 
p(xlwa, 02), respectively. Assume that 6; and 6 are initially statistically independent, 
sO that p(O1, 02) = pi (01 )p2 (02). 
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(a) Show that after one sample x; from the mixture density is observed, p(01, 02|x1) 
can no longer be factored as 


p(91|x1)p2(02|x1) 


Op(x|wi, 0i) ro 
A # 0, 1= 1,2. 


(b) What does this imply in general about the statistical dependence of parameters 
in unsupervised learning? 


8. Assume that a mixture density p(x|@) is identifiable. Prove that under very 
general conditions that p(@|D”) converges (in probability) to a Dirac delta function 
centered at the true value of O as the number of samples becomes very large. 

9. Assume the likelihood function of Eq. 3 is differentiable and derive the maximum 
likelihood conditions of Eqs. 11 — 13. 
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10. Let p(xlw;,0,) ~ N(p,, 3), where X is a common covariance matrix for the c 
component densities. Let o,, be the pgth element of X, 01 be the pgth element of 
=x~', £ (k) be the pth element of xy, and up(i) be the pth element of p;. 


(a) Show that 


Aln p(xp|wi,O;) — € Ong 
Dora E 


where 
Ez 1 ifp=q 
pq) 0 fpXa 


(b) Use this result and the results of Problem 6 to show that the maximum-likelihood 
estimate for X must satisfy 


$ = E 5 XEX} — 5 P(w;) hh, 
k=1 


i=1 


where P(w;) and fa, are the maximum-likelihood estimates given by Eqs. 19 & 20. 
11. Show that the maximum-likelihood estimate of a prior probability can be zero by 
considering the following special case. Let p(a|w,) ~ N(0, 1) and p(a|w2) ~ N(0, 1/2), 
so that P(w,) is the only unknown parameter in the mixture 
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(a) Show that the maximum-likelihood estimate P(w1) of P(w1) is zero if one sample 
xı is observed and if z? < In 2. 


(b) What is the value of P(w,) if 2? > In 2? 


(c) Summarize and interpret your answer in words. 


12. Consider the univariate normal mixture 


piola, 1) = Yo Po | (Ey 


210 


in which all of the c components have the same, known, variance 0?. Suppose that 
the means are so far apart compared to ø that for any observed x all but one of the 
terms in this sum are negligible. Use a heuristic argument to show that the value of 


1 
max (tn plaisso es Balkis) | 


ought to be approximately 


5 P(w,;)In P(w,;) — sn P2roe] 


j=1 


when the number n of independently drawn samples is large. (Here e is the base of 
the natural logarithms.) 

13. Let x1,...,x, be n d-dimensional samples and > be any non-singular d-by-d 
matrix. Show that the vector x that minimizes 


So (xp —x) E xp — x) 
k=1 
is the sample mean, X =1/n >» Xp. 
k=1 

14. Perform the differentiation in Eq. 26 to derive Eqs. 27 & 28. 

15. Show that the computational complexity of Algorithm 1 is O(ndcT), where n, 
is the number of d-dimensional patterns, c the assumed number of clusters and T the 
number of iterations. 

16. Fill in the steps of the derivation of Eqs. 19 — 21. 
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17. Consider the combinatorics of exhaustive inspection of clusters of n samples into 
c clusters. 


(a) Show that there are exactly 


Q| = 


: c c=i¿n 
Ea 
i=1 
such distinct clusterings. 


(b) How many clusters are there for n = 100 and c= 5? 
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(c) Find an approximation for your answer to (a) for the case n > c. Use your 
answer to estimate the number of clusterings of 1000 points into 10 clusters. 
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18. Prove that the ranking of distances between samples discussed in Sect. ?? is 
invariant to any monotonic transformation of the dissimilarity values. Do this as 
follows: 


(a) Define the value vp for the clustering at level k, and for level 1 let vı = 0. For all 
higher levels, vz is the minimum dissimilarity between pairs of distinct clusters 
at level k — 1. Explain why with both min and mazg the value vz either stays 
the same or increases as k increases. 


(b) Assume that no two of the n samples are identical, so that vz > 0. Use this to 
prove monotonicity, i.e., that 0 = v1 < vg < ug < +++ < Up. 
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19. Derive Eq. 50 from Eq. 49 using the definition given in Eq. 51. 

20. If a set of n samples D is partitioned into c disjoint subsets D,,...,D¿, the 
sample mean m; for samples in D; is undefined if D; is empty. In such a case, the 
sum of squared errors involves only the non-empty subsets: 


Je= Y Y |x- mll? 


D; #0 xeD, 


Assuming that n > c, show there are no empty subsets in a partition that minimizes 
Je. Explain your answer in words. 

21. Consider a set of n = 2k + 1 samples, k of which coincide at x = —2, k at x = 0, 
and one at x =a >Q. 


(a) Show that the two-cluster partitioning that minimizes Je groups the k samples 
at x = 0 with the one at x = a if a? < 2(k + 1). 


(b) What is the optimal grouping if a? > 2(k + 1)? 


22. Let x; = (els xə = (1), x3 = (?), and x4 = ee and consider the following three 
partitions: 


ihe Dı = {x1, X2}, Da = {x3, X4} 
2. Dy = {x1, X4}, D2 = {X2, X3} 
3: Dı => [x1,X2,X3), Da = {x4} 


Show that by the sum-of-square error Je criterion (Eq. ??), the third partition is 
favored, whereas by the invariant J¿ (Eq. 63) criterion the first two partitions are 
favored. 

23. Let xı = (E XQ = En, x3 = C and x4 = C and consider the following 
three partitions: 


1. Dı = {x1, X2}, Da = {x3, X4} 
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2. Dy = {x1, X4}, Do = {X2, x3} 

3. Dy = {x1, X2, X3}, Da = {xa} 
(a) Find the clustering that minimizes the sum-of-squared error criterion, Je (Eq. ??). 
(b) Find the clustering that minimizes the trace criterion, Ja (Eq. 63). 
24. Consider the problem of invariance to transformation of the feature space. 


(a) Show the eigenvalues A1,...,Ag of Sw Sp are invariant to nonsingular linear 
transformations of the data. 


(b) Show that the eigenvalues 11,...,vg of S7 Sw are related to those of Sy Ss 


(c) Use your above results to show that Ja = |Sw|/|Sz| is invariant to nonsingular 
linear transformations of the data. 


25. Recall the definitions of the within-cluster and the between-cluster scatter ma- 
trices (Eqs. 57 & 58). Define the total scatter matrix to be Sr = Sy + Sg. Show 
that the following measures (Eqs. 65 & 66) are invariant to linear transformations of 
the data. 


d 
(a) 187 Sw = Y Tx, 


d 
(b) [Swl/Srl = II rex 


d 
() ISwSel = IX 


(d) What is the typical value of the criterion in (c)? Why, therefore, is that criterion 
not very useful? 


26. Show that the clustering criterion Ja in Eq. 63 is invariant to linear transforma- 
tions of the space as follows. Let T be a nonsingular matrix and consider the change 
of variables x’ = Tx. 


(a) Write the new mean vectors mí and scatter matrices Sí in terms of the old 
values and T. 


(b) Calculate J} in terms of the (old) Ja and show that they differ solely by an 
overall scalar factor. 


(c) Since this factor is the same for all partitions, argue that J¿ and J} rank the 
partitions in the same way, and hence that the optimal clustering based on Ja 
is invariant to nonsingular linear transformations of the data. 


27. Consider the problems that might arise when using the determinant criterion for 
clustering. 


(a) Show that the rank of the within-cluster scatter matrix S; can not exceed n;—1, 
and thus the rank of Sw can not exceed X` (n; — 1) = n — c. 
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(b) Use your answer to explain why the between cluster scatter matrix Sg may 
become singular. (Of course, if the samples are confined to a lower dimensional 
subspace it is possible to have Sw be singular even though n — c > d.) 
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28. One way to generalize the basic-minimum-squared-error procedure is to define 
the criterion function 


where m; is the mean of the n; samples in D; and Sr is the total scatter matrix. 


(a) Show that Jr is invariant to nonsingular linear transformations of the data. 


(b) Show that the transfer of a sample x from D; to D; causes Jr to change to 
nj 
Nj + 1 


A | (x — m;)'Sz1(% — my) (x — m,)'Sz1(& — m,)]. 


(c) Using this result, write pseudocode for an iterative procedure for minimizing Jr 
(cf. Computer Exercise 20). 


29. Consider how the transfer of a single point from one cluster to another affects 
the mean and sum-squared error, and thereby derive Eqs. 71 & 72. 
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30. Let a similarity measure be defined as s(x, x’) = x’x’/(||x|] - ||x’]|). 


(a) Interpret this similarity measure if the d features have binary values, where 
x; = 1 if x possesses the ith feature and x; = —1 if it does not. 


(b) Show that for this case the squared Euclidean distance satisfies 


IIx — x’? = 2d(1 — s(x, x’). 


31. Let d be the dimensionality of the space, q a scalar parameter (q > 1). For each 
of the measures shown, state whether it represents a metric (or not), and whether it 
represents an ultrametric (or not). 


(a) s(x,x’) = ||x — x"[1? (squared Euclidean) 
(b) s(x,x’) = [lx — x’ (Euclidean) 
d 1/q 
(c) s(x,x”) = (2 [2 — ale) (Minkowski) 
k=1 
(d) s(x, x°) = x'x’/||x||||x"|| (cosine) 


(e) s(x, x’) = xsd (dot product) 
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(£) s(x,x’) = ming ||x + aT(x) — x’||? (one-sided tangent distance) 
where T is a linear transform and a a vector of coefficients (cf. Chap. ??, Sect. 
77). 


32. Let cluster D; contain n; samples, and let dj; be some measure of the distance 
between two clusters D; and D;. In general, one might expect that if D; and D; are 
merged to form a new cluster Dg, then the distance from Dj, to some other cluster 
Dp is not simply related to dni and d,¿. However, consider the equation 


dhk = adni + odas + Bdij + Yldai— dngl. 


Show that the following choices for the coefficients a, aj, 3, and y lead to the distance 
functions indicated. 


(a) dmin : 04 = aj = 0.5, 8 = 0, y = —0.5. 


(b) dmaz : ai = aj; = 0.5, 8 = 0, y = +0.5. 


. g. = — Ti a h =p = 
: 0i = minap = maap h = aiaj, y = 0. 


33. Consider a hierarchical clustering procedure in which clusters are merged so as 
to produce the smallest increase in the sum-of-squared error at each step. If the ith 
cluster contains n; samples with sample mean m;, show that the smallest increase 
results from merging the pair of clusters for which 


n;n; 
Ni + Nj 


|m; — |” 


is minimum. 

34. Assume we are clustering using the sum-of-squared error criterion Je (Eq. ??). 
Show that a “distance” measure between clusters can be derived, Eq. 78, such that 
merging the “closest” such clusters increases Je as little as possible. 

35. Create by hand a dendrogram for the following eight points in one dimension: 
{—5.5, —4.1, —3.0, —2.6, 10.1, 11.9, 12.3, 13.6}. Define the similarity between to clus- 
ters to be 20 — dmin(Di, Dj), where dmin(Di, Dj) is given in Eq. 74. Based on your 
dendrogram, argue that two is the natural number of clusters. 

36. Create by hand a dendrogram for the following 10 points in one dimension: 
{—2.2, —2.0, —0.3, 0.1, 0.2, 0.4, 1.6, 1.7, 1.9, 2.0}. Define the similarity between to clus- 
ters to be 20 — dmin(Di, Dj), where dmin(Di, D4) is given in Eq. 74. Based on your 
dendrogram, argue that three is the natural number of clusters. 

37. Assume that the nearest-neighbor cluster algorithm has been allowed to continue 
fully, thereby giving a tree with a path from any node to any other node. Show that 
the sum of the edge lengths of this resulting tree will not exceed the sum of the edge 
lengths for any other spanning tree for that set of samples. 
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38. Assume that a large number n of d-dimensional samples has been chosen from 
a multidimensional Gaussian, i.e., p(x) ~ N(m, E), where X is an arbitrary positive- 
definite covariance matrix. 
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(a) Prove that the distribution of the criterion function J.(1) given in Eq. 82 is 
normal with mean ndo?. Express ø in terms of Y. 


(b) Prove that the variance of this distribution is 2ndo*. 


(c) Consider a suboptimal partition of the Gaussian by a hyperplane through the 
sample mean. Show that for large n, the sum of squared error for this partiction 
is approximately normal with mean n(d—2/7)o? and variance 2n(d — 8/1?)o*, 
where o is given in part (a). 


39. Derive Eqs. 85 & 86. 
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40. Consider a simple greedy algorithm for creating a spanning tree. 


(a) Write pseudocode for creating a minimal spanning tree linking n points in d 
dimension. 


(b) Let k denote the average linkage per node. What is the average space complexity 
of your algorithm? 


(c) What is the average time complexity? 
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41. Consider the adaptive resonance clustering algorithm. 
(a) Show that the standard ART algorithm cannot learn the XOR problem. 


(b) Explain how the number of clusters generated by the adaptive resonance algo- 
rithm depends upon the order of presentation of the samples. 


(c) Discuss the benefits and drawbacks of adaptive resonance in stationary and in 
non-stationary environments. 
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42. Show that minimizing a mean-squared error criterion for d-dimensional data 
leads to the k-dimensional representation (k < d) of the Karhunen-Loéve transform 
(Eq. 90) as follows. For simplicity, assume that the data set has zero mean. (If the 
mean is not zero, we can always subtract off the mean from each vector to define a 
new vectors.) 


(a) The (scalar) projection of a vector x onto a unit vector e, a(e) = xte, is, of 
course, a random variable. Define the variance of a to be o? = Ex[a?]. Show 
that 0? = e'Ne, where Y = €x[xx*] is the correlation matrix. 


(b) A vector e that yields an extremal or stationary value of this variance must obey 
0?(e + ĝe) = 0?(e), where de is a small perturbation. Show that this condition 
implies (de)'Xe = 0 at such a stationary point. 


(c) Consider small variations de that do not change the length of the vector, i.e., 
ones in which de is perpendicular to e. Use this condition and your above results 
to show that (Je) NX A(Je)fe = 0, where A is a scalar. Show that the necessary 
and sufficient solution is Ne = Ae — that is, the eigenvector equation of Eq. 99. 


78 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 


(d) Define a sum-squared-error criterion for a set of points in d-dimensional space 
and their projections onto a k-dimensional linear subspace (k < d). Use your 
results above to show that in order to minimize your criterion, the subspace 
shoud be spanned by the k largest eigenvectors of the correlation matrix. 


43. Show that a neural net auto-association network consisting of d — k — d input, 
hidden and output layer (with k < d) 


(a) Show that a neural net auto-association network consisting of d — k — d input, 
hidden and output layer (with k < d) and linear hidden units performs principal 
component analysis by considering the minimization it solves. Trained on sum 
squared error. 


(b) Show that a neural net auto-association network consisting of d — k — d input, 
hidden and output layer (with k < d) 


(c) Show that the five layer neural net auto-association network of Fig. ?? consisting 
of d— k — r — k — d where both layers having k units are nonlinear will perform 
nonlinear dimensionality reduction. 


44. Consider the use of neural networks for nonlinear principal componet analysis. 


(a) Prove that if all units in the five-layer network of Fig. 10.22 are linear, and the 
network trained to serve as an auto-encoder, then the representation learned at 
the middle layer corresponds to the linear principal component of the data. 


(b) State briefly why this also implies that a three-layer network (input, hidden, 
output) cannot be used for non-linear principal component analysis, even if the 
middle layer consists of non-linear units. 


45. The derivation of the Independent component analysis algorithm, summarized 
in Eq. 99, assumed that the sources and sum signals were all scalars, that there was 
no noise, and that the number of observations, T, is equal to the number of points 
generated by each source. 


(a) Relax all of these conditions to generalize the method to vectors, X1(t) +... + 
X¿(t). Assume, moreover, that the sum signal is corrupted by additive Gaussian 
noise of zero mean, but unknown covariance: p(y) ~ N(0, X). 


(b) Suppose the noise is sufficiently small (|X| < 1), and that the dimensionality 
of the vectors is set to d = 1. Show that your learning rule reduces to that of 
Eq. 99. 


46. Use the fact that the sum samples from two Gaussians is again a Gaussian to 
show why independent component analysis can not isolate sources perfectly if more 
than one has a Gaussian distribution. 

47. It is a fact that the Kullback-Liebler divergence is invariant under general 
invertible transforms. Prove this for the special case of linear transforms, as used in 
Sect. 10.13. 
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48. Consider the use of multidimensional scaling for representing the points xı = 
(1,0),x2 = (0,0)* and x3 = (0,1) in one dimensions. To obtain a unique solution, 
assume that the image points satisfy 0 = yı < ye < y3. 
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(a) Show that the criterion function Jee is minimized by the configuration with 
y2 = (1 + V2)/3 and y3 = 2yp. 


(b) Show that the criterion function Jff is minimized by the configuration with 
yo = (2+ V2)/4 and y3 = 2y2. 


Computer exercises 


Several exercises make use of the data in the following table. 


sample zı T2 T3 sample Ly T2 T3 
1 -7.82 | -4.58 | -3.97 11 6.18 | 2.81 | 5.82 
2 -6.68 | 3.16 | 2.71 12 6.72 | -0.93 | -4.04 
3 4.36 | -2.19 | 2.09 13 -6.25 | -0.26 | 0.56 
4 6.72 | 0.88 | 2.80 14 -6.94 | -1.22 | 1.13 
5 -8.64 | 3.06 | 3.50 15 8.09 | 0.20 | 2.25 
6 -6.87 | 0.57 | -5.45 16 6.81 | 0.17 | -4.15 
7 4.47 | -2.62 | 5.76 17 -5.19 | 4.24 | 4.04 
8 6.73 | -2.01 | 4.18 18 -6.38 | -1.74 | 1.43 
9 -7.71 | 2.34 | -6.33 19 4.08 | 1.30 | 5.33 
10 -6.91 | -0.49 | -5.68 20 6.27 | 0.93 | -2.78 
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1. Consider the univariate normal mixture 


P(w1) eae 1 ~ P(wr) 1 ca 
x|0) = ex H ex a 
pije) 27101 p 2 1 V 2102 j 2 02 
Write a general program for computing the maximum likelihood values of the pa- 


rameters, and apply it to the 20 x; points in the table above under the following 
assumptions of what is known and what is unknown: 


(a) Known: P(w1) = 0.5, 01 = 02 = 1; Unknown: p and ug. 
(b) Known: P(w1) = 0.5; Unknown: 01 = 02 =<, p1 and po. 
(c) Known: P(w1) = 0.5; Unknown: 01, 02, 11 and yuo. 

(d) Unknown: P(w1), 01, 02, y1 and po. 


2. Write a program to implement k-means clustering (Algorithm 1), and apply it to 
the three-dimensional data in the table for the following assumed numbers of clusters, 
and starting points. 


(a) Let c = 2, m; (0) = (1,1, 1)’ and ma(0) = (-1,1,-1). 


(b) Let c = 2, m,(0) = (0,0,0) and ma(0) = (1,1,—1)'. Compare your final 
solution with that from part (a), and explain any differences, including the 
number of iterations for convergence. 


(c) Let c=3, mı (0) = (0,0, 0)’, ma(0) = (1,1,1)* and m3(0) = (—1,0, 2)’. 
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(d) Let c = 3, m,(0) = (—0.1,0,0.1)*, m2(0) = (0,—0.1,0.1)% and m3(0) = 
(—0.1,—-0.1,.1)*. Compare your final solution with that from part (c), and 
explain any differences, including the number of iterations for convergence. 


3. Repeat Computer exercise 2, but use instead a fuzzy k-means algorithm (Algo- 
rithm 1) with the “blending” be set by b = 2 (Eqs. 27 & 28). 

4. Explore the problems that can come with mis-specifying the number of clusters in 
the fuzzy k-means algorithm (Algorithm 2) using the following one-dimensional data: 
D = [-5.0, —4.5, —4.1, —3.9, 2.5, 2.8, 3.1, 3.9, 4.5}. 


(a) Use your program in the four conditions defined by c = 2 and c = 3, and b= 1 
and b = 4. In each cases initialize the cluster centers to distinct values, but ones 
near x = 0. 


(b) Compare your solutions to the c = 3, b = 4 case to the c = 3, b = 1 case, and 
discuss any sources of the differences. 


5. Show how a few labeled samples in a k-means algorithm can improve clustering 
of unlabeled samples in the following, somewhat extreme case. 


(a) Generate 50 two-dimensional samples for each of four spherical Gaussians, p(x|w;) ~ 
N(u,,1), where Hı = (e Hə = al H3 = (ke and Ha = (a): 


(b) Choose c = 4 initial positions for the cluster means randomly from the full 200 
samples. What is the probability that your random selection yields exactly one 
cluster center for each component density? (Make the simplifying assumption 
that the component densities do not overlap significantly.) 


(c) Using the four samples selected in part (b), run a k-means clusterer on the full 
200 points. (If the four points in fact come from different components, re-select 
samples to insure that at least two come from the same component density 
before using your clusterer.) 


(d) Now assume you have some label information, in particular four samples known 
to come from distinct component densities. Using these as your initial cluster 
centers, run a k-means clusterer on the full 200 points. 


(e) Discuss the value of a few labeled samples for clustering in light of the final 
clusters given in (c & d). 
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6. Explore unsupervised Bayesian learning of the mean of a Gaussian distribution 
following way. 


(a) Generate a data set D of 30 points, uniformly distributed in the interval —10 < 
x < +10. 


(b) Assume the data comes from a normal distribution with known variance, but 
unknown mean, i.e., p(x) ~ N(u,2) — that is, the unknown parameter 0 in 
Eq. 37 is simply the scalar u. Assume a wide prior for the parameter: p(u) 
is uniform in the range —10 < y < +10. Plot posterior probabilities for k = 
0, 1,2,3,4, 5, 10, 15, 20, 25, 30 points from D. 
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(c) Now assume instead a narrow prior, i.e., p(1) uniform in the range —1 < u < +1, 
and repeat part (b) using the same order of data presentation. 


(d) Are your curves for part (b) and part (c) the same for small number of points? 
For large number of points? Explain. 


7. Write a decision-directed clusterer related to k-means in the following way. 


(a) First, generate a set D of n = 1000 three-dimensional points in the unit square, 


(b) Randomly choose c = 4 of these points as the initial cluster centers m;, j = 
1,2,3,4. 


(c) The core of the algorithm operates as follows: First, each sample x;, is classified 
by the nearest cluster center mz. Next, each mean m; is calculated to be 
the mean of the samples in wj. If there is no change in the centers after n 
presentations, halt. 


(d) Use your algorithm to plot four trajectories of the position of the cluster centers. 


(e) What is the space and the time complexities of this algorithm? State any 
assumptions you invoke. 
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8. Explore the role of metrics, similarity measures and thresholds on cluster formation 
in the following way. 


(a) First, generate a two-dimensional data set consisting of two parts: Dı contains 
100 points whose distance from the origin is chosen uniformly in the range 
3 <r < 5, and angular position uniform in the range 0 < ¢ < 27; likewise, Da 
consists of 50 points of distance 0 < r < 2 and angle 0 < 4 < 27. The full data 
set used below is D = D¡ UD». 


(b) Write a simple clustering algorithm that links any two points x and x’ if 
d(x, x’) < 0, where 0 is a threshold selected by the user, and distance is calcu- 
lated by means of a general Minkowski metric (Eq. 44), 


d 1/q 
d(x,x) = (>: [2 — sar) 


k=1 


Let q = 2 (Euclidean distance) and apply your algorithm to the data D for 
the following thresholds: 0 = 0.01,0.05,0.1,0.5,1,5. In each case, plot all 150 
points and differentiate the clusters by color or other plotting convention. 


(c) Repeat part (b) with q = 1 (city block distance). 
(d) Repeat part (b) with q = 4. 


(e) Discuss how the metric affects the “natural” number of clusters implied by your 
results. 
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9. Explore different clustering criteria by exhaustive search in the following way. Let 
D be the first seven three-dimensional points in the table above. 


(a) If we assume that any cluster must have at least one point, how many cluster 
configurations are possible for the seven points? 


(b) Write a program to search through each of the cluster configurations, and for 
d 

each compute the following criteria: J. (Eq. 49), Ja (Eq. 63), X A; (Eq. 64), 
i=l 


J; = trSz'Sw (Eq. 65) and |S|/|S7| (Eq. 66). show the optimal clusters for 
each of your four criteria. 
Perform a whitening transformation on your points are repeat part (b). 


(c 
(d 


e 


In light of your results, discuss which of the criteria are invariant to the whitening 
transformation. 


> 
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10. Show that the Basic Iterative Least-Squares clustering algorithm gives solutions 
and final criterion values that depend upon starting conditions in the following way. 
Implement Algorithm 3 for c = 3 clusters and apply it to the data in the table above. 
For each simulations, list the final clusters as sets of points (identified by their index 
in the table), along with the corresponding value of the criterion function. 


m,(0 1,1,1)*, ma(0) = (-1,—1,-—1) and m3(0) = (0, 0, 0). 


) mi(0) = ( 
) m,(0) = (0.1,0.1,0.1)*, m2(0) = (—0.1, —0.1, —0.1)t and ms(0) = (0, 0, 0)*. 
(c) m,(0) = (2,0, 2), ma(0) = (-2,0, —2)* and m3(0) = (1,1, 1). 
) m,(0) = (0.5, 1,0.2)*, ma(0) = (0.2, —1,0.5)* and m3(0) = (0.2, 0.4, 0.6)*. 
) Explain why your final answers differ. 
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11. Implement the basic hierarchical agglomerative clustering algorithm (Algo- 
rithm 4), as well as a method for drawing dendrograms based on its results. Apply 
your algorithm and draw dendrograms to the date in the table above using the dis- 
tance measure indicated below. Define the similarity between two clusters to be linear 
in distance, with similarity = 100 for singleton clusters (c = 20) and similarity = 0 
for the single cluster (c = 1). 


(a) dmin (Eq. 74) 
(b) dmax (Eq. 75) 
(c) davg (Eq. 76) 
(d) dmean (Eq. 77) 
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12. Explore the use of cluster dendrograms for selecting the “most natural” number 
of clusters. 


(a) Write a program to perform hierarchical clustering and display a dendrogram, 
using measure of distance to be selected from the Eqs. 74 — 77. 


(b) Write a program to generate n/c points from each of c one-dimensional Gaus- 
sians, p(z|w;) ~ N(;,07), i =1,...,c. Use your program to generate n = 50 
points, 25 in each of two clusters, with y = 0, u2 = 1, and o? = of = 1. 
Repeate with u2 = 4. 


(c) Use your program from (a) to generate dendrograms for each of the two data 
sets generated in (?7). 


(d) The difference in similarity values for successive levels is a random variable, 
which we can model as a normal distribution with mean and variance. Suppose 
we define the “most natural” number of clusters according to the largest gap in 
similarity values, and that this largest gap is significant if it differs “significantly” 
from the distribution. State your criterion analytically, and show that one of 
the cases in (??) indeed has two clusters. 
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13. xxx 
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14. Implement a basic competitive learning clustering algorithm (Algorithm 6) and 
apply it to the three-dimensional data in the table above as follows. 


(a) First, preprocess the data by augmenting each vector with xp = 1 and normal- 
izing to unit length. In this way, each point lies on the surface of a hypersphere. 


(b) Set c = 2, and let the inital (normalized) weght vectors correspond to patterns 
1 & 2. Let the learning rate be y = 0.1. Present the patterns in cyclic order, 
e DOT OD oa MN Drs 


(c) Modify your program so as to reduce the learning rate by multiplying by the 
constant factor a < 1 after each pattern presentation, so the learning rate 
approaches zero exponentially. Repeat your simulation of part (b) with such 
decay, where a = 0.99. Compare your final clusterings with those from using 
a=0.5. 


(d) Repeat part (c) but with the patterns chosen in a random order, i.e., with the 
probability of presenting any given pattern being 1/20 per trial. Discuss the 
role of random versus sequenced pattern presentation on the final clusterings. 
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15. PCA exercise 
16. Explore the use of independent component analysis for blind source separation 
in the following example. 
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(a) Generate 100 points for t = 1,...,100 for x1(t) = xxx and xa(t) = xxx. Gen- 
erate 100 points each for three sensors according to: 


z(t) = xxx 


xa(t) = xxx 
and three sensors: 


s1(t) = xxx 
sa(t) = xxx 


s3(t) = vax 


(Of course, in this blind source separation task, neither the source signals nor 
the mixing parameters are known.) 


(b) xxx 
17. Repeat Computer exercise 16, but for three sources: 


zı(t) = xxx 
xa(t) = xxx 


x3(t) = xxx 


and four sensors: 


si(t) = xxx 
salt) = wrx 
s3(t) = vara 
sa(t) = vara 
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18. Write a computer program that uses the general maximum-likelihood equation of 
Sect. ?? iteratively to estimate the unknown means, variances, and prior probabilities. 
Use this program to find maximum-likelihood estimates of these parameters for the 
data in Table ??. 

19. hill climbing for clustering. Start at BAD and at GOOD starting places. Note 
that do not get same answer. 

20. Write a program to perform the minimization of in Problem 28. 

21. what if you have the wrong number of clusters?....xxx 
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Appendix A 


Mathematical foundations 


ur goal here is to present the basic results and definitions from linear algebra, 
O probability theory, information theory and computational complexity that serve 
as the mathematical foundations for pattern recognition. We will try to give intuitive 
insight whenever appropriate, but do not attempt to prove these results; systematic 
expositions can be found in the references. 


A.1 Notation 


Here are the terms and notation used throughout the book. In addition, there are 
numerous specialized variables and functions whose definitions and usage should be 
clear from the text. 


variables, symbols and operations 


~ approximately equal to 
= equivalent to (or defined to be) 
x proportional to 
oo infinity 
r>a x approaches a 
tot+1 in an algorithm: assign to variable t the new value t+ 1 
lim f(x) the value of f(x) in the limit as x approaches a 
r—a 
arg max f(x) the value of x that leads to the maximum value of f(x) 
T 
arg min f(x) the value of x that leads to the minimum value of f(x) 
T 
[x] ceiling of z, i.e., the least integer not smaller than z (e.g., [3.5] = 4 
|x] floor of z, i.e., the greatest integer not larger than z (e.g., |3.5] = 
m mod n m modulo n, the remainder when m is divided by n (e.g., 7 mod 5 
ln(x) logarithm base e, or natural logarithm of x 
log(x) logarithm base 10 of x 
log, (a) logarithm base 2 of x 


exp[x] or e” 


OTI 
f f(x)dx 


F(a; 0) 
E 
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exponential of x, i.e., e raised to the power of x 
partial derivative of f with respect to x 


the integral of f(x) between a and b. If no limits are written, the 
full space is assumed 

function of x, with implied dependence upon 0 

Q.E.D., quod erat demonstrandum (“which was to be proved ”) — 
used to signal the end of a proof 


mathematical operations 


<T> 


expected value of random variable x 

mean or average value of x 

the expected value of function f(x) where x is a random variable 
the expected value of function over several variables, f(x,y), taken 
over a subset y of them 

the variance, i.e., Es[(x — Ef [x])?] 


the sum from 7 = 1 to n: a; +42+... + dn 


the product from i = 1 to n: a, X 42 X ... X An 


convolution of f(x) with g(x) 


vectors and matrices 


RA 
x,A,... 


diag(a1, a2, ..., da) 


[lx 
x 
tr[A] 


d-dimensional Euclidean space 

boldface is used for (column) vectors and matrices 

vector-valued function (note the boldface) of a scalar 
vector-valued function (note the boldface) of a vector 

identity matrix, square matrix having 1s on the diagonal and 0 
everywhere else 

vector of length 7 consisting solely of 1’s 

matrix whose diagonal elements are ay, a2, ...,@q, and off-diagonal 
elements 0 

transpose of vector x 

Euclidean norm of vector x 

covariance matrix 

the trace of A, i.e., the sum of its diagonal components: tr[A] = 


d 
y Qii 

i=1 

the inverse of matrix A 

pseudoinverse of matrix A 

determinant of A 

eigenvalue 

eigenvector 

unit vector in the ith direction in Euclidean space 
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Sets 

A,B,C,D,... “Calligraphic” font generally denotes sets or lists, e.g., data set 
D = {x1,...,)Xn} 

xED x is an element of set D 

x¢D x is not an element of set D 

AUB union of two sets, i.e., the set containing all elements of A and B 

|D| the cardinality of set D, i.e., the number of (possibly non-distinct) 
elements in it; occassionally written card|D| 

max[D] the maximum zx value in set D 


probability, distributions and complexity 


w state of nature 
P(-) probability 
p(-) probability density 
P(a, b) the joint probability , i.e., the probability of having both a and b 
p(a, b) the joint probability density, i.e., the probability density of having 
both a and b 
Pr{-} the probability of a condition being met, e.g., Pr{x < xo} means 
the probability that x is less than x 
p(x|@) the conditional probability density of x given 0 
w weight vector 
AC, +) loss function 
V= e gradient operator in R”, sometimes written grad 
Sa 
de, 
Va = ee gradient operator in 0 coordinates, sometimes written gradg 
d 
dha 
ô maximum likelihood value of 6 
~ “has the distribution,” e.g., p(x) ~ N(,07) means that the density 
of x is normal, with mean y and variance 0? 
N(u, 07) normal or Gaussian distribution with mean y and variance g? 
N(p, ©) multidimensional normal or Gaussian distribution with mean ps 
and covariance matrix Y 
U (Bitu) a one-dimensional uniform distribution between x; and £u 
U (Xi, Xu) a d-dimensional uniform density, i.e., uniform density within the 
smallest axes-aligned bounding box that contains both x; and x,y, 
and zero elsewhere 
T(p, ô) triangle distribution, having center y and full half-width 6 
(a) Dirac delta function 
T(-) Gamma function 
n! n factorial = n x (n — 1) x (n-2)x...x 1 
ea = A binomial coefficient, n choose k for n and k integers 
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O(h(x)) big oh order of h(x) 

O(h(1)) big theta order of h(x) 

Q(h(z)) big omega order of h(x) 

sup f(x) the supremum value of f(x) — the global maximum of f(x) over 


all values of x 


A.2 Linear algebra 


A.2.1 Notation and preliminaries 


A d-dimensional column vector x and its transpose x’ can be written as 
P 


x= . and x’ = (z1 T2 ... La), (1) 
Ba 


where all components can take on real values. We denote an n x d (rectangular) 
matrix M and its d x n transpose Mt as 


mai Mi Miz ... Mi 
ma Maa Maz ... Mad 

M = and (2) 
Mni Mn2 MnZ ... Mnd 
m11 M21 naw Mn 
m12 M22 PE Mn2 

Mt = M13 M23 +... Mhn3 . (3) 
Mid Md +... Mnd 


In other words, the jith entry of Mt is the ijth entry of M. 
A square (d x d) matrix is called symmetric if its entries obey mij = Mji; it is 


called skew-symmetric (or anti-symmetric) if m,; = —m,;. A general matrix is called 
non-negative if m;; > 0 for alli and j. A particularly important matrix is the identity 
IDENTITY matrix, I — a dx d (square) matrix whose diagonal entries are 1’s, and all other entries 
MATRIX 0. The Kronecker delta function or Kronecker symbol, defined as 
KRONECKER TIRE 
DELTA Oyj = { r e E (4) 
otherwise, 


can serve to define the entries of an identity matrix. A general diagonal matrix (i.e., 
one having 0 for all off diagonal entries) is denoted diag(m11, M22, ..., Maa), the entries 
being the successive elements m31,ma2,...,Mad. Addition of vectors and of matrices 
is component by component. 

We can multiply a vector by a matrix, Mx = y, i.e., 
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Yı 
1m11 m12 saa Mid T1 Ya 
moa1 M22 ya Md Ta (5) 
= e 
Mni Mn2 oe) Mnd Xd 
Yn 


where 


d 
yj = Y mt. (6) 
¿=1 


Note that the number of columns of M must equal the number of rows of x. Also, if 
M is not square, the dimensionality of y differs from that of x. 


A.2.2 Inner product 


The inner product of two vectors having the same dimensionality will be denoted here 
as xy and yields a scalar: 


d 
xy = 5 iyi = yX. (7) 
i=1 


It is sometimes also called the scalar product or dot product and denoted x è y, or 
more rarely (x,y). The Euclidean norm or length of the vector is 


I|x|| = vx’x. (8) 
we call a vector “normalized” if ||x|| = 1. The angle between two d-dimensional 
vectors obeys 

cos O = E e (9) 
ixl} liy 


and thus the inner product is a measure of the colinearity of two vectors — a natural 
indication of their similarity. In particular, if x’y = 0, then the vectors are orthogonal, 
and if ||x’y|| = ||x]|| |ly||, the vectors are colinear. From Eq. 9, we have immediately 
the Cauchy-Schwarz inequality, which states 


IIx'yll < lix] lly- (10) 


We say a set of vectors (x1,X2,...,Xn y is linearly independent if no vector in the 
set can be written as a linear combination of any of the others. Informally, a set of d 
linearly independent vectors spans an d-dimensional vector space, i.e., any vector in 
that space can be written as a linear combination of such spanning vectors. 


A.2.3 Outer product 


The outer product (sometimes called matrix product or dyadic product) of two vectors 
yields a matrix 


INNER 
PRODUCT 


EUCLIDEAN 
NORM 


LINEAR 
INDEPEND- 
ENCE 


MATRIX 
PRODUCT 


JACOBIAN 
MATRIX 
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Tı Z1Y1 11Y2 PER T1Yn 
: T2 T2Y1 T2Y2 «++ T2Yn 

M = xy" = (Y1 Ya +++ Yn) = e A , ap 
Xd Tayı TaYy2 ... TdYn 


that is, the components of M are mij = xiyj. Of course, if the dimensions of x and 
y are not the same, then M is not square. 


A.2.4 Derivatives of matrices 


Suppose f(x) is a scalar-valued function of d variables x;, i = 1,2,...d, which we 
represent as the vector x. Then the derivative or gradient of f with respect to this 
vector is computed component by component, i.e., 


OT) (12) 


Of (x) 


Ola 


If we have an n-dimensional vector-valued function f (note the use of boldface), 
of a d-dimensional vector x, we calculate the derivatives and represent them as the 
Jacobian matriz 


AO 
Of (x) e PS 
igen CN E (13) 
Ofn (x) Of. (x) 
0x1 mae Ola 


If this matrix is square, its determinant (Sect. A.2.5) is called simply the Jacobian or 
occassionally the Jacobian determinant. 

If the entries of M depend upon a scalar parameter 0, we can take the derivative 
of M component by component, to get another matrix, as 


Oomi1 Om12 Omid 
30 Toa.“ 700 
da a22 Bes 
OM _ 30 90.“ “00 (14) 
Omni  WMnz Omnd 
00 00 ee 00 


In Sect. A.2.6 we shall discuss matrix inversion, but for convenience we give here the 
derivative of the inverse of a matrix, M”!: 


Ort _¡0M, 
zM t= -M “y M r (15) 
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Consider a matrix M that is independent of x. The following vector derivative 
identities can be verified by writing out the components: 


ð 

z Mx] = M (16) 
Zy = ¿béy]=y (17) 
bx Mx] = [M + Mt]x. (18) 


In the case where M is symmetric (as for instance a covariance matrix, cf. Sect. A.4.10), 
then Eq. 18 simplifies to 


O és = 2Mx. (19) 
Ox 


We first recall the use of second derivatives of a scalar function of a scalar x in 
writing a Taylor series (or Taylor expansion) about a point: 


1 d f(e) 
2! dx? 


(£ — xo)? + O((x — z£o)?). (20) 


T=X0 


(a — 20) + 


Analogously, if our scalar-valued f is a instead function of a vector x, we can expand 
f(x) in a Taylor series around a point xo: 


160) = f(x) 4 a xo) + 3 TE xo) + O(I xo), (21) 
Ba X=X0 EA X=X0 


where H is the Hessian matrix, the matrix of second-order derivatives of f(-), here 
evaluated at xo. (We shall return in Sect. A.8 to consider the O(-) notation and the 
order of a function used in Eq. 21 and below.) 


A.2.5 Determinant and trace 


The determinant of a d x d (square) matrix is a scalar, denoted [MÍ], and reveals 
properties of the matrix. For instance, if we consider the columns of M as vectors, if 
these vectors are not linearly independent, then the determinant vanishes. In pattern 
recognition, we have particular interest in the covariance matrix X, which contains 
the second moments of a sample of data. In this case the absolute value of the 
determinant of a covariance matrix is a measure of the d-dimensional hypervolume 
of the data that yielded Y. (It can be shown that the determinant is equal to the 
product of the eigenvalues of a matrix, as mentioned in Sec. A.2.7.) If the data 
lies in a subspace of the full d-dimensional space, then the columns of Y are not 
linearly independent, and the determinant vanishes. Further, the determinant must 
be non-zero for the inverse of a matrix to exist (Sec. A.2.6). 

The calculation of the determinant is simple in low dimensions, and a bit more 
involved in high dimensions. If M is itself a scalar (i.e., a 1 x 1 matrix M), then 
M| =M. If Mis 2x2, then |M] = m11m232 — m31m12. The determinant of a general 
square matrix can be computed by a method called expansion by minors, and this 


HESSIAN 
MATRIX 


EXPANSION 
BY MINORS 


COFACTOR 
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leads to a recursive definition. If M is our d x d matrix, we define Mj; to be the 
(d — 1) x (d—1) matrix obtained by deleting the it” row and the j*” column of M: 


m1 Mi12 


Ma M22 M2d 


¡le e e 


Mad 


Given the determinants [M,, |, we can now compute the determinant of M the ex- 
pansion by minors on the first column giving 


|M] = mi1[M; 1] — Mma |Ma | + m31[M3j1] — +++ = ma Mgpl, (23) 


where the signs alternate. This process can be applied recursively to the successive 
(smaller) matrixes in Eq. 23. 

Only for a 3x3 matrix, this determinant calculation can be represented by “sweep- 
ing” the matrix, i.e., taking the sum of the products of matrix terms along a diagonal, 
where products from upper-left to lower-right are added with a positive sign, and those 
from the lower-left to upper-right with a minus sign. That is, 


m1 M12 713 
M21 Ma Ma (24) 
™31 M32 M33 


IMI 


II 


m11M22M33 + M13M21M32 + M12M23M31 
—M13M22M31 — M11M23M32 — M12M21M33. 
Again, this “sweeping” mnemonic does not work for matrices larger than 3 x 3. 

For any matrix we have |M| = |M‘|. Furthermore, for two square matrices of 
equal size M and N, we have |MN| = |M] [N]. 

The trace of a d x d (square) matrix, denoted tr[M], is the sum of its diagonal 
elements: 


d 
tr[M] = E Mii- (25) 


Both the determinant and trace of a matrix are invariant with respect to rotations of 
the coordinate system. 
A.2.6 Matrix inversion 


So long as its determinant does not vanish, the inverse of a d x d matrix M, denoted 
M7}, is the d x d matrix such that 


MM™! =I. (26) 


We call the scalar Cj; = (—1)'*9|Mj,;| the i, j cofactor or equivalently the cofactor of 
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the i,j entry of M. As defined in Eq. 22, Mj); is the (d— 1) x (d— 1) matrix formed 
by deleting the ¿th row and jth column of M. The adjoint of M, written Adj[M], is 
the matrix whose 7,7 entry is the j,i cofactor of M. Given these definitions, we can 
write the inverse of a matrix as 


de Adj[M] 
M7! = 
IMI 


(27) 


If M is not square (or if M~! in Eq. 27 does not exist because the columns of M are 
not linearly independent) we typically use instead the pseudoinverse Mt, defined as 


M' = [M'M] MĂ. (28) 


The pseudoinverse is useful because it insures M'M = I. 


A.2.7 Ejigenvectors and eigenvalues 


Given a d x d matrix M a very important class of linear equations is of the form 


Mx = Ax (29) 


for scalar A, which can be rewritten 


(M — \I)x =0, (30) 


where I the identity matrix, and O the zero vector. The solution vector x = e; and 
corresponding scalar \ = A; to Eq. 29 are called the eigenvector and associated eigen- 
value. There are d (possibly non-distinct) solution vectors (e,,ez,...,eg) each with 
an associated eigenvalue {A1, A2,..-, Aa}. Under multiplication by M the eigenvectors 
are changed only in magnitude — not direction: 


Me; = Ajej. (31) 


If M is diagonal, then the eigenvectors are parallel to the coordinate axes. 
One method of finding the eigenvectors and eigenvalues is to solve the character- 
istic equation (or secular equation), 


[M — AT] = Atat +... +a9-14+04=0, (32) 


for each of its d (possibly non-distinct) roots A;. For each such root, we then solve a 
set of linear equations to find its associated eigenvector ej. 


Finally, it can be shown that the determinant of a matrix is just the product of 
its eigenvalues: 


d 
IM] = IL). (33) 
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A.3 Lagrange optimization 


Suppose we seek the position xy of an extremum of a scalar-valued function f(x), 
subject to some constraint. If a constraint can be expressed in the form g(x) = 0, 
then we can find the extremum of f(x) as follows. First we form the Lagrangian 
function 


L(x, A) = f(x) + Ag(x), (34) 
Ba 
=0 
where A is a scalar called the Lagrange undetermined multiplier. We convert this con- 
strained optimization problem into an unconstrained problem by taking the derivative, 


OL(x,A) _ Of(x) gx) _ 
Ox Ox FA Ox 


and using standard methods from calculus to solve the resulting equations for A and 
the extremizing value of x. (Note that the last term on the left hand side does not 
vanish, in general.) The solution gives the x position of the extremum, and it is a 
simple matter of substitution to find the extreme value of f(-) under the constraints. 


0, (35) 


A.4 Probability Theory 


A.4.1 Discrete random variables 


Let x be a discrete random variable that can assume any of the finite number m of 
different values in the set Y = {v1,v2,...,Um}. We denote by p; the probability that 
x assumes the value v;: 


pi = Pr{w#=vu;}, i=1,...,m. (36) 


Then the probabilities p; must satisfy the following two conditions: 


pi > 0 and 
m 
Xp = 1 (37) 
i=1 
Sometimes it is more convenient to express the set of probabilities [p1,P2,...,Pm) 


in terms of the probability mass function P(x), which must satisfy the following two 
conditions: 


P(x) > 0 and 
y Ple) = land (38) 
TEX 
$ Pae) = 0 (39) 


Xg 


MEAN 
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A.4.2 Expected values 


The expected value, mean or average of the random variable x is defined by 


Elz] =n = > 2P(8) = De Vipi. (40) 


If one thinks of the probability mass function as defining a set of point masses, with 
pi being the mass concentrated at x = v;, then the expected value y is just the center 
of mass. Alternatively, we can interpret y as the arithmetic average of the values in a 
large random sample. More generally, if f(x) is any function of x, the expected value 
of f is defined by 


Elf) = $ f(e) P(e). (41) 


TEX 


Note that the process of forming an expected value is linear, in that if a; and ag are 
arbitrary constants, 


Elo fi(x) + 02 fo(x)] = 01€ [fi (x2)] + a2€[ f(x). (42) 


It is sometimes convenient to think of € as an operator — the (linear) expectation 
operator. Two important special-case expectations are the second moment and the 
variance: 


ee] = Soe Pa) (43) 
TEX 
Vare] = 0 =El[(e—y)"] = Y (2-0 Pe), (44) 


TEX 


where øo is the standard deviation of x. The variance can be viewed as the moment of 
inertia of the probability mass function. The variance is never negative, and is zero 
if and only if all of the probability mass is concentrated at one point. 

The standard deviation is a simple but valuable measure of how far values of x 
are likely to depart from the mean. Its very name suggests that it is the standard 
or typical amount one should expect a randomly drawn value for x to deviate or 
differ from y. Chebyshev's inequality (or Bienaymé-Chebyshev inequality) provides a 
mathematical relation between the standard deviation and |x — pl: 


1 
Pr{|x — u| > no} < =. (45) 
n 


This inequality is not a tight bound (and it is useless for n < 1); a more practical rule 
of thumb, which strictly speaking is true only for the normal distribution, is that 68% 
of the values will lie within one, 95% within two, and 99.7% within three standard 
deviations of the mean (Fig. A.1). Nevertheless, Chebyshev’s inequality shows the 
strong link between the standard deviation and the spread of a distribution. In 
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addition, it suggests that |~—|/o is a meaningful normalized measure of the distance 
from z to the mean (cf. Sect. A.4.12). 
By expanding the quadratic in Eq. 44, it is easy to prove the useful formula 


Var[x] = E[x?] — (€[2]y?. (46) 


Note that, unlike the mean, the variance is not linear. In particular, if y= ax, where 
a is a constant, then Var[y] = a?Var[x]. Moreover, the variance of the sum of two 
random variables is usually not the sum of their variances. However, as we shall see 
below, variances do add when the variables involved are statistically independent. 

In the simple but important special case in which x is binary valued (say, v1 = 0 
and va = 1), we can obtain simple formulas for u and ø. If we let p = Pr{x = 1), 
then it is easy to show that 


u = p and 
o = vp(1-p). (47) 


A.4.3 Pairs of discrete random variables 


Let x and y be random variables which can take on values in Y = [0,,03,..., Um}, 
and Y = [w1,wa,...,w,p, respectively. We can think of (x,y) as a vector or a point 
in the product space of x and y. For each possible pair of values (v;, wj) we have a 
joint probability pij = Pr{x = v;, y = wj}. These mn joint probabilities p;; are non- 
negative and sum to 1. Alternatively, we can define a joint probability mass function 
P(x,y) for which 


P(x,y) > 0 and 


y.) Play = A (48) 


TEX yey 


The joint probability mass function is a complete characterization of the pair of ran- 
dom variables (x,y); that is, everything we can compute about x and y, individually 
or together, can be computed from P(x, y). In particular, we can obtain the separate 
marginal distributions for x and y by summing over the unwanted variable: 


yey 
Pu) = D P(z, y). (49) 


We will occassionally use subscripts, as in Eq. 49, to emphasize the fact that 
P,(x) has a different functional form than P,(y). It is common to omit them and 
write simply P(x) and P(y) whenever the context makes it clear that these are in 
fact two different functions — rather than the same function merely evaluated with 
different variables. 
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A.4.4 Statistical independence 


Variables x and y are said to be statistically independent if and only if 


Plz, y) = Pe(x)Py(y). (50) 


We can understand such independence as follows. Suppose that p; = Pr{a = v;} is 
the fraction of the time that x = v;, and q; = Pr{y = wj} is the fraction of the time 
that y = wj. Consider those situations where x = v;. If it is still true that the fraction 
of those situations in which y = w; is the same value q;, it follows that knowing the 
value of x did not give us any additional knowledge about the possible values of y; 
in that sense y is independent of x. Finally, if x and y are statistically independent, 
it is clear that the fraction of the time that the specific pair of values (v;, wj) occurs 
must be the product of the fractions pq; = P(v;)P(w;). 


A.4.5 Expected values of functions of two variables 


In the natural extension of Sect. A.4.2, we define the expected value of a function 
f(x,y) of two random variables x and y by 


Elf(z,y)| = 5 5 f(z, y)P(z, y), (51) 


TEX YEY 


and as before the expectation operator € is linear: 


Elon fi(z,y) + o2fa(z,y)| = € [fi (e, y)] + él falo, y). (52) 


The means (first moments) and variances (second moments) are: 


Ma = Elz] = ye Y 2P(5,y) 
TEX yey 
=E] = Y, Y Pew) 
TEX yey 
oz = Vf] =€l(2—u03)] = Y Y (be)? P(e, y) 
TEX YEY 
o? =Vlyl = Ellu- y] = YY (y-1)*Plo, y). (53) 
TEX YEY 


An important new “cross-moment” can now be defined, the covariance of x and 


Cay = El(a — pe) (y — )] = Y Y (@ — y — My) Plz, y). (54) 


TEX YEY 


We can summarize Eqs. 53 & 54 using vector notation as: 


TE Y) xP) (55) 


xE{ xy) 
x= Ef(x—p)(x— u)', (56) 
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where {XY} respresents the space of all possible values for all components of x and 
Y is the covariance matrix (cf., Sect. A.4.9). 

The covariance is one measure of the degree of statistical dependence between x 
and y. If x and y are statistically independent, then oz, = 0. If ozy = 0, the variables 
x and y are said to be uncorrelated. It does not follow that uncorrelated variables must 
be statistically independent — covariance is just one measure of dependence. However, 
it is a fact that uncorrelated variables are statistically independent if they have a 
multivariate normal distribution, and in practice statisticians often treat uncorrelated 
variables as if they were statistically independent. If œ is a constant and y = az, which 
is a case of strong statistical dependence, it is also easy to show that osy = a0?. Thus, 
the covariance is positive if x and y both increase or decrease together, and is negative 
if y decreases when x increases. 

There is an important Cauchy-Schwarz inequality for the variances 0, and oy and 
the covariance Ogy. It can be derived by observing that the variance of a random 
variable is never negative, and thus the variance of Ax + y must be non-negative no 
matter what the value of the scalar A. This leads to the famous inequality 


e, < oo? (57) 


which is analogous to the vector inequality (x*y)? < ||x||? ||y||? given in Eq. 8. 
The correlation coefficient, defined as 


p=, (58) 
OxOy 
is a normalized covariance, and must always be between —1 and +1. If p = +1, 
then « and y are maximally positively correlated, while if p = —1, they are maxi- 
mally negatively correlated. If p = 0, the variables are uncorrelated. It is common for 
statisticians to consider variables to be uncorrelated for practical purposes if the mag- 
nitude of their correlation coefficient is below some threshold, such as 0.05, although 
the threshold that makes sense does depend on the actual situation. 
If x and y are statistically independent, then for any two functions f and g 


Elf(x)a(y)] = EIFE )], (59) 


a result which follows from the definition of statistical independence and expectation. 
Note that if f(x) = x — Hs and g(y) = y — Hy, this theorem again shows that 
Oxy = E|(£ — Ux) (y — y )) is zero if x and y are statistically independent. 


A.4.6 Conditional probability 


When two variables are statistically dependent, knowing the value of one of them 
lets us get a better estimate of the value of the other one. This is expressed by the 
following definition of the conditional probability of x given y: 


Prix = v; y = w;} 
Pr{y = w;} 


Pr{e = vily = wj} = 


or, in terms of mass functions, 
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Note that if x and y are statistically independent, this gives P(x|y) = P(x). That 
is, when x and y are independent, knowing the value of y gives you no information 
about x that you didn’t already know from its marginal distribution P(x). 

Consider a simple illustration of a two-variable binary case where both x and y 
are either 0 or 1. Suppose that a large number n of pairs of xy-values are randomly 
produced. Let nij be the number of pairs in which we find x = 7 and y = J, i.e., we 
see the (0,0) pair noo times, the (0,1) pair ny, times, and so on, where noo + noi + 
nig + N11 =n. Suppose we pull out those pairs where y = 1, i.e., the (0,1) pairs and 
the (1,1) pairs. Clearly, the fraction of those cases in which v is also 1 is 


M11 = n/n (62) 
Nor + N11 (noi + ni1) /n- 


Intuitively, this is what we would like to get for P(x|y) when y = 1 and n is large. 


And, indeed, this is what we do get, because n11/n is approximately P(x,y) and 
nii/n 


Marea is approximately P(y) for large n. 


A.4.7 The Law of Total Probability and Bayes” rule 


The Law of Total Probability states that if an event A can occur in m different ways 
A1,Az2,..., Am, and if these m subevents are mutually exclusive — that is, cannot 
occur at the same time — then the probability of A occurring is the sum of the 
probabilities of the subevents A;. In particular, the random variable y can assume 
the value y in m different ways — with x = v1, with x = v2, ..., and £ = vm. Because 
these possibilities are mutually exclusive, it follows from the Law of Total Probability 
that P(y) is the sum of the joint probability P(x,y) over all possible values for x. 
Formally we have 


P(y) = Y Plz, y). (63) 


TEX 


But from the definition of the conditional probability P(y|x) we have 
P(x,y) = Ply|x)P(2), (64) 
and after rewriting Eq. 64 with x and y exchanged and a trivial math, we obtain 


__P(yle) P(e) 
Y Pyle) P(e) 


TEX 


P(aly) (65) 


or in words, 


likelihood x prior 


posterior = - 
evidence 
where these terms are discussed more fully in Chapt. ??. 

Equation 65 is called Bayes’ rule. Note that the denominator, which is just P(y), is 
obtained by summing the numerator over all x values. By writing the denominator in 
this form we emphasize the fact that everything on the right-hand side of the equation 
is conditioned on x. If we think of x as the important variable, then we can say that 
the shape of the distribution P(x|y) depends only on the numerator P(y|x)P(x); the 
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denominator is just a normalizing factor, sometimes called the evidence, needed to 
insure that the P(x|y) sum to one. 

The standard interpretation of Bayes’ rule is that it “inverts” statistical connec- 
tions, turning P(y|x) into P(a|y). Suppose that we think of x as a “cause” and y as 
an “effect” of that cause. That is, we assume that if the cause x is present, it is easy 
to determine the probability of the effect y being observed; the conditional probability 
function P(y|x) — the likelihood — specifies this probability explicitly. If we observe 
the effect y, it might not be so easy to determine the cause x, because there might 
be several different causes, each of which could produce the same observed effect. 
However, Bayes’ rule makes it easy to determine P(x|y), provided that we know both 
P(y|x) and the so-called prior probability P(x), the probability of x before we make 
any observations about y. Said slightly differently, Bayes’ rule shows how the prob- 
ability distribution for x changes from the prior distribution P(x) before anything is 
observed about y to the posterior P(a|y) once we have observed the value of y. 


A.4.8 Vector random variables 


To extend these results from two variables x and y to d variables 71, %9,...,2q, it is 
convenient to employ vector notation. As given by Eq. 48, the joint probability mass 
function P(x) satisfies P(x) > 0 and > P(x) = 1, where the sum extends over all 
possible values for the vector x. Note that P(x) is a function of d variables, and can 
be a very complicated, multi-dimensional function. However, if the random variables 
x; are statistically independent, it reduces to the product 


P(x) = Pr, (#1)Px2(#2) +++ Pra (aa) 


= [[ P(e). (66) 


where we have used the subscripts just to emphasize the fact that the marginal distri- 
butions will generally have a different form. Here the separate marginal distributions 
P, (xi) can be obtained by summing the joint distribution over the other variables. 
In addition to these univariate marginals, other marginal distributions can be ob- 
tained by this use of the Law of Total Probability. For example, suppose that we have 


P(#1,22,%3,%4,%5) and we want P(#1,24), we merely calculate 


P(x, 24) = Y Y Y P(t, £2, £3, %4, £5). (67) 


TQ T3 T5 


One can define many different conditional distributions, such as P(x1,£2|£3) or 
P(x2|x£1, 24, 15). For example, 


P(x, £2, 23) 


Pas) (68) 


P(zx1, £2|23) = 
where all of the joint distributions can be obtained from P(x) by summing out the un- 
wanted variables. If instead of scalars we have vector variables, then these conditional 
distributions can also be written as 


P(x, X2) 


P(x1|x2) = P(x)” 


(69) 


EVIDENCE 
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and likewise, in vector form, Bayes’ rule becomes 
P(X2|x1)P(x1) 


P(x, |x2) = A 
Y" P(xo|x1)P(x1) 


(70) 


A.4.9 Expectations, mean vectors and covariance matrices 


The expected value of a vector is defined to be the vector whose components are 
the expected values of the original components. Thus, if f(x) is an n-dimensional, 
vector-valued function of the d-dimensional random vector x, 


pe] 
f(x 
f(x) = , , (71) 
Fn(x) 
then the expected value of f is defined by 
Elf1(x)] 
E| fo(x)| 
E[f] = =>) f(x)P(x). (72) 
E[fn(x)] 
In particular, the d-dimensional mean vector u is defined by 
E|x1 pa 
Elx2 pa 
u = Elx] = , = . | =P) (73) 
€ (La Ha 
Similarly, the covariance matrix Y is defined as the (square) matrix whose ijth element 
gij is the covariance of x; and zj: 
Gij = Oji = E(x: — i) (jz — By) to) Saeed, (74) 
as we saw in the two-variable case of Eq. 54. Therefore, in expanded form we have 
E[(@1 — 1) (1 — 1)] E[(@1 — pr) ("2 — pa) El(x1 — u1) (za — Ha) 
i as E[(z2 — p2)(£1 — 1)]  El(w2 — 112) (2 — pu2)] E[(v2 — ua) (Ta — Ha)] 
Ellza— pa) (21 — pa)]  Ellza— pa) (22 — n2) El(a — pa) (£a — a) 
O11 012 rae Old a 0712 ses Old 
021 O22 ... 0% 021 ae ase 02 
Cda Ud +... Odd Odi Ud «>. o7 


We can use the vector product (x — )(x — 2)’, to write the covariance matrix as 


Y = E[(x - w)(x— p)'. (76) 
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Thus, > is symmetric, and its diagonal elements are just the variances of the in- 
dividual elements of x, which can never be negative; the off-diagonal elements are 
the covariances, which can be positive or negative. If the variables are statistically 
independent, the covariances are zero, and the covariance matrix is diagonal. The 
analog to the Cauchy-Schwarz inequality comes from recognizing that if w is any d- 
dimensional vector, then the variance of w'x can never be negative. This leads to the 
requirement that the quadratic form w’=Xw never be negative. Matrices for which 
this is true are said to be positive semi-definite; thus, the covariance matrix X must 
be positive semi-definite. It can be shown that this is equivalent to the requirement 
that none of the eigenvalues of % can be negative. 


A.4.10 Continuous random variables 


When the random variable x can take values in the continuum, it no longer makes 
sense to talk about the probability that x has a particular value, such as 2.5136, 
because the probability of any particular exact value will almost always be zero. 
Rather, we talk about the probability that x falls in some interval (a,b); instead of 
having a probability mass function P(x) we have a probability mass density function 
p(a). The mass density has the property that 


b 
PE J de (77) 


The name density comes by analogy with material density. If we consider a small 
interval (a, a + Ax) over which p(x) is essentially constant, having value p(a), we see 
that p(a) = Pr{x € (a,a+ Azx))/Az. That is, the probability mass density at x = a 
is the probability mass Príw € (a,a+ Ax)} per unit distance. It follows that the 
probability density function must satisfy 


p(x)>0 and 
J nde =1, (78) 


In general, most of the definitions and formulas for discrete random variables carry 
over to continuous random variables with sums replaced by integrals. In particular, 
the expected value, mean and variance for a continuous random variable are defined 
by 


Ba 
— 
as 
= 
Il 


7 f(x)p(@) de 


=E] = J zp(x) dx (79) 
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and, as in Eq. 46, the variance obeys o? = E[x?] — (€[x])?. 
The multivariate situation is similarly handled with continuous random vectors x. 
The probability density function p(x) must satisfy 


p(x) >0 and 
I p(x) dx = 1, (80) 


where the integral is understood to be a d-fold, multiple integral, and where dx is the 
element of d-dimensional volume dx = dzıdzə---dxza. The corresponding moments 
for a general n-dimensional vector-valued function are 


Elf(x)] = j fe J too daxydaz...dxq = J two dx (81) 


— 00 — 00 — 00 


and for the particular d-dimensional functions as above, we have 


u = E[x] (82) 


II 
—, 
A 
2 
a 
* 


B= El) = f e- pl) ax 


If the components of x are statistically independent, then the joint probability density 
function factors as 


px) = [] pla) (83) 


and the covariance matrix is diagonal. 
Conditional probability density functions are defined just as conditional mass func- 
tions. Thus, for example, the density for x given y is given by 


p(z, y) 
p(aly) = 84 
(ely) ae) (84) 
and Bayes’ rule for density functions is 


soy = _P(yle)p(2) . (85) 


P(y|x)p(@) dx 


=00 


and likewise for the vector case. 
Occassionally we will need to take the expectation with respect to a subset of the 
variables, and in that case we must show this as a subscript, for instance 


Co 


En [f(21,22)] = fr, 22)p(21) de. (86) 
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A.4.11 Distributions of sums of independent random variables 


It frequently happens that we know the densities for two independent random variables 
x and y, and we need to know the density of their sum z = x + y. It is easy to obtain 
the mean and the variance of this sum: 


Mz = Elz) = Efx + y] = Ela] + Ely] = He + by, 
o = &l(z—p,)"|= on (He + 1y))] = E[l — te) + (y — ty)? 
El(a — pe)"] + 2E[(e — pe) y — uy)] +El(y — wy)? (87) 
=0 
= 07 +0%, 


where we have used the fact that the cross-term factors into €[x — uu» JE [y — Hy] when 
x and y are independent; in this case the product is manifestly zero, since each of 
the component expectations vanishes. Thus, in words, the mean of the sum of two 
independent random variables is the sum of their means, and the variance of their 
sum is the sum of their variances. If the variables are random yet not independent — 
for instance y = —x, where x is randomly distributed — then the variance is not the 
sum of the component variances. 

It is only slightly more difficult to work out the exact probability density function 
for z = x+y from the separate density functions for x and y. The probability that z is 
between ¢ and ¢+ Az can be found by integrating the joint density p(x, y) = p(x)p(y) 
over the thin strip in the xy-plane between the lines z + y=¢ and z + y = Å + Az. 
It follows that, for small Az, 


co 


Pr{¢<z<¢+Az}= J p(a)p(¢ — x) isha. (88) 


— 00 


and hence that the probability density function for the sum is the convolution of the 
probability density functions for the components: 


Co 


p(z) = ple) + ply) = I p(a)p(2 — a) de. (89) 


=00 


As one would expect, these results generalize. It is not hard to show that: 


e The mean of the sum of d independent random variables £1, £2,...,&q is the 
sum of their means. (In fact the variables need not be independent for this to 
hold.) 


e The variance of the sum is the sum of their variances. 


e The probability density function for the sum is the convolution of the separate 
density functions: 


p(z) = p(x1) x p(x2) x... x p(xa). (90) 
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A.4.12 Univariate normal density 


One of the most important results of probability theory is the Central Limit Theorem, 
which states that, under various conditions, the distribution for the sum of d inde- 
pendent random variables approaches a particular limiting form known as the normal 
distribution. As such, the normal or Gaussian probability density function is very 
important, both for theoretical and practical reasons. In one dimension, it is defined 
by 
lo _ _ 2 
p(x) = —e 1/2((a@—p)/o)” | (91) 
270 
The normal density is traditionally described as a “bell-shaped curve”; it is com- 
pletely determined by the numerical values for two parameters, the mean y and the 
variance 0?. This is often emphasized by writing p(x) ~ N(u,o?), which is read as 
“x is distributed normally with mean u and variance 0?.” The distribution is sym- 
metrical about the mean, the peak occurring at x = u and the width of the “bell” 
is proportional to the standard deviation ø. The parameters of a normal density in 
Eq. 91 satisfy the following equations: 


él] = IS 


Elx] = J x p(a) dx = y (92) 


Elle? = J (œ — pY?p(a) de = 0. 


—0o 


Normally distributed data points tend to cluster about the mean. Numerically, the 
probabilities obey 


Pr{jc—pl<o} ~ 0.68 
Príle—p|<20) = 0.95 (93) 
Pr{|x — u| < 30} ~ 0.997, 


as shown in Fig. A.1. 
A natural measure of the distance from x to the mean y is the distance |x — y| 
measured in units of standard deviations: 


|z pl 

roco? (94) 
the Mahalanobis distance from z to u. (In the one-dimensional case, this is sometimes 
called the z-score.) Thus for instance the probability is 0.95 that the Mahalanobis 
distance from 2 to u will be less than 2. If a random variable x is modified by 
(a) subtracting its mean and (b) dividing by its standard deviation, it is said to be 
standardized. Clearly, a standardized normal random variable u = (x — u)/0 has zero 
mean and unit standard deviation, that is, 


GAUSSIAN 


MAHALANOBIS 
DISTANCE 
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p(u) 
A 


Figure A.1: A one-dimensional Gaussian distribution, p(w) ~ N(0,1), has 68% of its 
probability mass in the range |u| < 1, 95% in the range |u| < 2, and 99.7% in the 
range |u| < 3. 


1 2 
_ —u* /2 
u) = e : 95 
pl) = 2 (95) 
which can be written as p(u) ~ N(0,1). Table A.1 shows the probability that a value, 
chosen at random according to p(u) ~ N(0,1), differs from the mean value by less 
than a criterion z. 


Table A.1: The probability a sample drawn from a standardized Gaussian has absolute 
value less than a criterion, i.e., Pr[|u| < z] 


z Prijul<z)| z  Prllul< z] Zz Prijul < z] 
0.0 0.0 1.0 0.682 2.0 0.954 
0.1 0.080 1.1 0.728 2.1 0.963 
0.2 0.158 1.2 0.770 2.326 0.980 
0.3 0.236 1.3 0.806 2.5 0.988 
0.4 0.310 1.4 0.838 2.576 0.990 
0.5 0.382 1.5 0.866 3.0 0.9974 
0.6 0.452 1.6 0.890 3.090 0.9980 
0.7 0.516 1.7 0.910 3.291 0.999 
0.8 0.576 1.8 0.928 3.5 0.9996 
0.9 0.632 1.9 0.942 4.0 0.99994 


A.5 Gaussian derivatives and integrals 


Because of the prevalence of Gaussian functions throughout statistical pattern recog- 
nition, we often have occassion to integrate and differentiate them. The first three 
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derivatives of a one-dimensional (standardized) Gaussian are 


2 A = => ¿2? / (207) = —p(c) 

a — ES ere] - + (—o? + 2?) en t*/(207) a a (96) 

ae Te —2?/(207) | = -z (300? — z?) -2?/(20?) HA ÓN 
and are shown in Fig. A.2. 


Figure A.2: A one-dimensional Gaussian distribution and its first three derivatives, 
shown for f(x) ~ N(0, 1). 


An important finite integral of the Gaussian is the so-called error function, defined 


erf(u y=? j: 2/2 q, (97) 


As can be seen from Fig. A.1, erf(0) = 0, erf(1) = 0.68 and lim erf(x) = 1. There 


is no closed analytic form for the error function, and thus we typically use tables, 
approximations or numerical integration for its evaluation (Fig. A.3). 

In calculating moments of Gaussians, we need the general integral of powers of x 
weighted by a Gaussian. Recall first the definition of a gamma function 


as 


CoO 


T(n +1) = forera (98) 
0 
where the gamma function obeys 
T(n) =nT(n— 1) (99) 


and ['(1/2) = yr. For n an integer we have [(n+1) =n x(n-1)x(n-2)...1=n!, 
read “n factorial.” 
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erf(u) 


Figure A.3: The error function corresponds to the area under a standardized Gaussian 
(Eq. 97) between —u and u, i.e., it describes the probability that a sample drawn 
from a standardized Gaussian obeys |x| < u. Thus, the complementary probability, 
1 — erf(u) is the probability that a sample is chosen with |x| > u. Chebyshev's 
inequality states that for an arbitrary distribution having standard deviation = 1, 
this latter probability is bounded by 1/u?. As shown, this bound is quite loose for a 
Gaussian. 


Changing variables in Eq. 98, we find the moments of a (normalized) Gaussian 
distribution as 


(100) 


ra —a? / (20°) gn/2 n 1 
o | xn e dr = o (7 + ) l 
270 yT 2 


where again we have used a pre-factor of 2 and lower integration limit of 0 in order 
give non-trivial (i.e., non-vanishing) results for odd n. 


A.5.1 Multivariate normal densities 


Normal random variables have many desirable theoretical properties. For example, it 
turns out that the convolution of two Gaussian functions is again a Gaussian function, 
and thus the distribution for the sum of two independent normal random variables is 
again normal. In fact, sums of dependent normal random variables also have normal 
distributions. Suppose that each of the d random variables x; is normally distributed, 
each with its own mean and variance: p(x;) ~ N(u;,0?). If these variables are 
independent, their joint density has the form 


d 
1 = le? 
p(x) = [rep = II € 1/2((2i—pi)/oi) 


— 1 i ap | ED (101) 


This can be written in a compact matrix form if we observe that for this case the 
covariance matrix is diagonal, i.e., 
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o? 0 0 
0 o 0 
y= . , (102) 
0 0 0% 
and hence the inverse of the covariance matrix is easily written as 
lo 0O wae 0 
0 Tyee ice 0 
pia] | UE (103) 
0 O° wy. Ijee 


Thus, the exponent in Eq. 101 can be rewritten using 


y (E = (x— uE (x - p). (104) 


Finally, by noting that the determinant of > is just the product of the variances, we 
can write the joint density compactly in terms of the quadratic form 


1 ER = n) 


p(x) = CLAPE 2 : (105) 


This is the general form of a multivariate normal density function, where the covari- 
ance matrix X is no longer required to be diagonal. With a little linear algebra, it 
can be shown that if x obeys this density function, then 


=e) =f xp(x)ax 


Y =El(x- u)(x-u)] = J (x — 1)(x— 1)'p(x) dx, (106) 


— 00 


just as one would expect. Multivariate normal data tend to cluster about the mean 
vector, 4, falling in an ellipsoidally-shaped cloud whose principal axes are the eigen- 
vectors of the covariance matrix. The natural measure of the distance from x to the 
mean p is provided by the quantity 


r? = (xD p), (107) 


which is the square of the Mahalanobis distance from x to pu. It is not as easy 
to standardize a vector random variable (reduce it to zero mean and unit covariance 
matrix) as it is in the univariate case. The expression analogous to u = (1—p)/0 isu = 
57 1/2(x— u), which involves the “square root” of the inverse of the covariance matrix. 
The process of obtaining ©~!/? requires finding the eigenvalues and eigenvectors of 
>, and is just a bit beyond the scope of this Appendix. 
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A.5.2 Bivariate normal densities 


It is illuminating to look at the bivariate normal density, that is, the case of two 


normally distributed random variables x; and x2. It is convenient to define of = 


011, oe = 022, and to introduce the correlation coefficient p defined by 


E el (108) 


0102 


With this notation, the covariance matrix becomes 
2 
S= da 02 | _ 01 ds (109) 
O21 922 PO102 03 
and its determinant simplifies to 
2.2 2 
[2] = of 09 (1 — p°). (110) 
Thus, the inverse covariance matrix is given by 
yo ok 1 | o2 — p002 | 
= AA 2 
a220- p) 


—=p0102 01 
a 1/07 — —p/(010>) 
= Tp | Ap aan 1/02 | E (111) 


Next we explicitly expand the quadratic form in the normal density: 


(x — 1) E (x -— u) 


= [(z1 — m) (£2 A | Eo E a | | = e | 


A (ay 2p (=) (22) l (2 . (112) 


Thus, the general bivariate normal density has the form 


1 

p titi) = ————_——— x 

zizal ) 21010211 — p? 

1 = 21 — H1 \ (T2— H2 T2 — lay? 
ex a Ma |, 
p | 2(1 = p?) [( Fi P O71 02 02 

As we can see from Fig. A.4, p(x1, x2) is a hill-shaped surface over the 1112 plane. 
The peak of the hill occurs at the point (x1, 72) = (41, u2), i.e., at the mean vector p. 
The shape of the hump depends on the two variances o? and 03, and the correlation 


coefficient p. If we slice the surface with horizontal planes parallel to the 1,12 plane, 
we obtain the so-called level curves, defined by the locus of points where the quadratic 


form 
_ 2 _ _ _ 2 
(SM) ap EM) (EM) (EM) ay 
O71 O71 02 02 


(113) 
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is constant. It is not hard to show that |p| < 1, and that this implies that the level 
curves are ellipses. The x and y extent of these ellipses are determined by the variances 
o and 03, and their eccentricity is determined by p. More specifically, the principal 
axes of the ellipse are in the direction of the eigenvectors e; of X, and the different 
widths in these directions VA;. For instance, if p = 0, the principal axes of the ellipses 
are parallel to the coordinate axes, and the variables are statistically independent. In 
the special cases where p = 1 or p = —1, the ellipses collapse to straight lines. Indeed, 
the joint density becomes singular in this situation, because there is really only one 
independent variable. We shall avoid this degeneracy by assuming that |p| < 1. 


p(x) 
A 


Figure A.4: A two-dimensional Gaussian having mean q and non-diagonal covariance 
>. If the value on one variable is known, for instance xı = 71, the distribution over 
the other variable is Gaussian with mean ji. 


One of the important properties of the multivariate normal density is that all 
conditional and marginal probabilities are also normal. To find such a density explic- 
itly, which we denote P»,|w, (12121), we substitute our formulas for pz,2.(%1, £2) and 
Px, (#1) in the defining equation 


Paix. (is x2) 
Pay (21) 


L al 
2110102/1 — p? 


x Viral) (115) 


Pxo|x1 (22|21) z 
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Thus, we have verified that the conditional density P.,|»,(21|12) is a normal distri- 
bution. Moreover, we have explicit formulas for the conditional mean poj and the 
conditional variance 07 T 


O: 
Ha = m+ po (a — pi) and 
031 = 0(1-p), (116) 


as illustrated in Fig. A.4. 

These formulas provide some insight into the question of how knowledge of the 
value of xı helps us to estimate x2. Suppose that we know the value of xı. Then 
a natural estimate for x2 is the conditional mean, pə. In general, poj is a linear 
function of x1; if the correlation coefficient p is positive, the larger the value of x1, 
the larger the value of yu3¡¡. If it happens that x is the mean value p1, then the best 
we can do is to guess that x2 is equal to u2. Also, if there is no correlation between 
x, and x2, we ignore the value of xı, whatever it is, and we always estimate x2 by 
3. Note that in that case the variance of x2, given that we know z1, is the same 
as the variance for the marginal distribution, i.e., 7 = 0. If there is correlation, 
knowledge of the value of xı, whatever the value is, reduces the variance. Indeed, 
with 100% correlation there is no variance left in x2 when the value of x; is known. 


A.6 Hypothesis testing 


Suppose samples are drawn either from distribution Do or they are not. In pattern 
classification, we seek to determine which distribution was the source of any sample, 
and if it is indeed Do, we would classify the point accordingly, into w1, say. Hypothesis 
testing addresses a somewhat different but related problem. We assume initially that 
distribution Do is the source of the patterns; this is called the null hypothesis, and 
often denoted Ho. Based on the value of any observed sample we ask whether we can 
reject the null hypothesis, that is, state with some degree of confidence (expressed as 
a probability) that the sample did not come from Do. 

For instance, Do might be a standardized Gaussian, p(x) ~ N(0,1), and our null 
hypothesis is that a sample comes from a Gaussian with mean u = 0. If the value of 
a particular sample is small (e.g., x = 0.3), it is likely that it came from the Do; after 
all, 68% of the samples drawn from that distribution have absolute value less than 
x = 1.0 (cf. Fig. A.1). If a sample’s value is large (e.g., x = 5), then we would be 
more confident that it did not come from Do. At such a situation we merely conclude 
that (with some probability) the sample was drawn from a distribution with u 4 0. 

Viewed another way, for any confidence — expressed as a probability — there 
exists a criterion value such that if the sampled value differs from u = 0 by more 
than that criterion, we reject the null hypothesis. (It is traditional to use confidences 
of .01 or .05.) We then say that the difference of the sample from 0 is statistically 
significant. For instance, if our null hypothesis is a standardized Gaussian, then if 
our sample differs from the value x = 0 by more than 2.576, we could reject the null 
hypothesis “at the .01 confidence level,” as can be deduced from Table A.1. A more 
sophisticated analysis could be applied if several samples are all drawn from Dg or 
if the null hypothesis involved a distribution other than a Gaussian. Of course, this 
usage of “significance” applies only to the statistical properties of the problem — it 
implies nothing about whether the results are “important.” Hypothesis testing is of 
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great generality, and useful when we seek to know whether something other than the 
assumed case (the null hypothesis) is the case. 


A.6.1 Chi-squared test 


Hypothesis testing can be applied to discrete problems too. Suppose we have n 
patterns — nı of which are known to be in w1, and na in wa — and we are interested 
in determining whether a particular decision rule is useful or informative. In this case, 
the null hypothesis is a random decision rule — one that selects a pattern and with 
some probability P places it in a category which we will call the “left” category, and 
otherwise in the “right” category. We say that a candidate rule is informative if it 
differs signficantly from such a random decision. 

What we need is a clear mathematical definition of statistical significance under 
these conditions. The random rule (the null hypothesis) would place Pn; patterns 
from w and Pna from wa independently in the left category and the remainder in 
the right category. Our candidate decision rule would differ significantly from the 
random rule if the proportions differed significantly from those given by the random 
rule. Formally, we let niz denote the number of patterns from category w; placed in 
the left category by our candidate rule. The so-called chi-squared statistic for this 
case is 


x? = 5 (mir = Mie)? (117) 


where according to the null hypothesis, the number of patterns in category w; that we 
expect to be placed in the left category is nje = Pn;. Clearly x? is non-negative, and 
is zero if and only if all the observed match the expected numbers. The higher the x?, 
the less likely it is that the null hypothesis is true. Thus, for a sufficiently high x?, the 
difference between the expected and observed distributions is statistically significant, 
we can reject the null hypothesis, and can consider our candidate decision rule is 
“informative.” For any desired level of significance — such as .01 or .05 — a table 
gives the critical values of x? that allow us to reject the null hypothesis (Table A.2). 

There is one detail that must be addressed: the number of degrees of freedom. 
In the situation described above, once the probability P is known, there is only one 
free variable needed to describe a candidate rule. For instance, once the number of 
patterns from w placed in the left category are known, all other values are determined 
uniquely. Hence in this case the number of degrees of freedom is 1. If there were more 
categories, or if the candidate decision rule had more possible outcomes, then df would 
be greater than 1. The higher the number of degrees of freedom, the higher must be 
the computed x? to meet a disired level of significance. 

We denote the critical values as, for instance, XU) = 6.64, where the subscript 
denotes the significance, here .01, and the integer in parentheses is the degrees of 
freedom. (In the Table, we conform to the usage in statistics, where this positive 
integer is denoted df, despite the possible confusion in calculus where it denotes an 
infinitessimal real number.) Thus if we have one degree of freedom, and the observed 
x? is greater than 6.64, then we can reject the null hypothesis, and say that, at the 
.01 confidence level our results did not come from a (weighted) random decision. 
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Table A.2: Critical values of chi-square (at two confidence levels) for different degrees 
of freedom (df) 


df — .05 .01 df 05 .01 df  .05 .01 
1 | 384 6.64 | 11 1968 24.72 | 21 | 32.67 38.93 
2| 59 9.21 | 12 | 21.03 26.22 | 22 | 33.92 40.29 
3| 7.82 11.34 [| 13 22.36 27.69 | 23 | 35.17 41.64 
4 | 9.49 13.28 || 14 23.68 29.14 | 24 | 36.42 42.98 
5 | 11.07 15.09 || 15 | 25.00 30.58 | 25 | 37.65 44.31 
6 | 12.59 16.81 [| 16 | 26.30 32.00 | 26 | 38.88 45.64 
7 | 14.07 18.48 | 17 | 27.59 33.41 | 27 | 40.11 46.96 
8 | 15.51 20.09 [| 18 | 28.87 34.80 | 28 | 41.34 48.28 
9 | 16.92 21.67 [| 19 | 30.14 37.57 | 29 | 42.56 49.59 

10 | 18.31 23.21 || 20 | 31.41 37.57 || 30 | 43.77 50.89 


A.7 Information theory 


A.7.1 Entropy and information 


Assume we have a discrete set of symbols {v1 v2 ... Um} with associated probabilities 
P;. The entropy of the discrete distribution — a measure of the randomness or 
unpredictability of a sequence of symbols drawn from it — is 


H =-— Y P, log, Pi, (118) 
i=1 


where since we use the logarithm base 2 entropy is measured in bits. In case any 
of the probabilities vanish, we use the relation 0 log 0 = 0. One bit corresponds 
to the uncertainty that can be resolved by the answer to a single yes/no question. 
(For continuous distributions, we often use logarithm base e, denoted In, in which 
case the unit is nat.) The expectation operator (cf. Eq. 41) can be used to write 
H = €llog 1/P], where we think of P as being a random variable whose possible 
values are P}, P2,..., Pm. The term log,1/P is sometimes called the surprise — if 
P; = 0 except for one i, then there is no surprise when the corresponding symbol 
occurs. 

Note that the entropy does not depend on the symbols themselves, just on their 
probabilities. For a given number of symbols m, the uniform distribution in which 
each symbol is equally likely, is the mazimum entropy distribution (and H = log, m 
bits) — we have the maximum uncertainty about the identity of each symbol that 
will be chosen. Clearly if x is equally likely to take on integer values 0,1,...,7, we 
need 3 bits to describe the outcome and H = log,2? = 3. Conversely, if all the p; 
are O except one, we have the minimum entropy distribution (H = 0 bits) — we are 
certain as to the symbol that will appear. 

For a continuous distribution, the entropy is 


Co 


H=- J p(x) In p(xjdx, (119) 


=o 
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and again H = Eflin 1/p]. It is worth mentioning that among all continuous density 
functions having a given mean y and variance o, it is the Gaussian that has the 
maximum entropy (H = .5 + log, (v270) bits). We can let o approach zero to find 
that a probability density in the form of a Dirac delta function, i.e., 


d(t-a) = { tare with 
00 ifx=a, 
I ô(x)dz = 1, (120) 
has the minimum entropy (H = —oo bits). For a Dirac function, we are sure that the 


value a will be selected each time. 

Our use of entropy in continuous functions, such as in Eq. 119, belies some sub- 
tle issues which are worth pointing out. If x had units, such as meters, then the 
probability density p(x) would have to have units of 1/x. There would be something 
fundamentally wrong in taking the logarithm of p(x) — the argument of the loga- 
rithm function should be dimensionless. What we should really be dealing with is a 
dimensionless quantity, say p(x) /po(x), where po(x) is some reference density function 
(cf., Sect. A.7.2). 

For discrete variable x and arbitrary function f(-), we have H(f(x)) < H(z), i.e., 
processing decreases entropy. For instance, if f(x) = const, the entropy will vanish. 
Another key property of the entropy of a discrete distribution is that it is invariant to 
“shuffling” the event labels. The related question with continuous variables concerns 
what happens when one makes a change of variables. In general, if we make a change of 
variables, such as y = 1% or even y = 102, we will get a different value for the integral 
of f q(y)log q(y) dy, where q is the induced density for y. If entropy is supposed 
to measure the intrinsic disorganization, it doesn’t make sense that y would have a 
different amount of intrinsic disorganization than x, since one is always derivable from 
the other; only if there were some randomness (e.g., shuffling) incorporated into the 
mapping could we say that one is more disorganized than the other. 

Fortunately, in practice these concerns do not present important stumbling blocks 
since relative entropy and differences in entropy are more fundamental than H taken 
by itself. Nevertheless, questions of the foundations of entropy measures for continu- 
ous variables are addressed in books listed in Bibliographical Remarks. 


A.7.2 Relative entropy 


Suppose we have two discrete distributions over the same variable x, p(x) and q(x). 
The relative entropy or Kullback-Leibler distance (which is closely related to cross 
entropy, information divergence and information for discrimination) is a measure of 
the “distance” between these distributions: 


Dex (ple), 4(2)) = Vat. (121) 


p(z) 
The continuous version is 
Dril) ao) = f aaas. (122) 
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Although Dxx(p(-),q(-)) > 0 and Dxx(p(-), g(-)) = 0 if and only if p(-) = q(-), the 
relative entropy is not a true metric, since Dx, is not necessarily symmetric in the 
interchange p > q and furthermore the triangle inequality need not be satisfied. 


A.7.3 Mutual information 


Now suppose we have two distributions over possibly different variables, e.g., p(x) and 
q(y). The mutual information is the reduction in uncertainty about one variable due 
to the knowledge of the other variable 


r(x, y) 


day (123) 


I(p; a) = H(p) — H(pla) = X` r(x, y)log 
vy 

where r(x, y) is the joint distribution of finding value x and y. Mutual information 
is simply the relative entropy between the joint distribution r(x, y) and the product 
distribution p(x)q(y) and as such it measures how much the distributions of the vari- 
ables differ from statistical independence. Mutual information does not obey all the 
properties of a metric. In particular, the metric requirement that if p(x) = q(y) then 
I(x; y) = 0 need not hold, in general. As an example, suppose we have two binary 
random variables with r(0,0) = r(1,1) = 1/2, so r(0,1) = r(1,0) = 0. According to 
Eq. 123, the mutual information between p(x) and q(y) is log 2 = 1. 

The relationships among the entropy, relative entropy and mutual information are 
summarized in Fig. A.5. The figure shows, for instance, that the joint entropy H(p, q) 
is always larger than individual entropies H(p) and H(q); that H(p) = H(plq) + 
I(p; q), and so on. 


H(p,q) 
> a 


H(p) 
H(q) 


Figure A.5: The mathematical relationships among the entropy of distributions p and 
q, mutual information I (p, q), and conditional entropies H(p|q) and H(qlp). From this 
figure one can quickly see relationships among the information functions. For instance 
we can see immediately that I (p; p) = H(p); that if I (p; q) = 0 then H(q|p) = H (q); 
that H(p, q) = H(pla) + H (q), and so forth. 


A.8 Computational complexity 


In order to analyze and describe the difficulty of problems and the algorithms de- 
signed to solve such problems, we turn now to the technical notion of computational 
complexity. For instance, calculating the covariance matrix for a samples is somehow 
“harder” than calculating the mean. Furthermore, some algorithms for computing 
some function may be faster or take less memory, than another algorithm. We seek 
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to specify such differences, independent of the current computer hardware (which is 
always changing anyway). 

To this end we use the concept of the order of a function and the asymptotic 
notations “big oh,” “big omega,” and “big theta.” The three asymptotic bounds 
most often used are: 


Asymptotic upper bound O(g(x)) = {f(x): there exist positive constants c and 
xy such that 0 < f(x) < cg(x) for all x > xo} 


Asymptotic lower bound Q(g(x)) = {f(x): there exist positive constants c and 
xo such that 0 < cg(x) < f(x) for all x > xo} 


Asymptotically tight bound O(g(z)) 


= (f(x): there exist positive constants C1, Ca, 
and xo such that 0 < cig(x) < f(a) < c 


29(a) for all x > xo} 


f(x) = O(g(x)) f(x) = Q(g(x)) f(x) = O(g(x)) 


f(x) 
c g(x) 


c g(x) 


f(x) 


a) b) c) 


Figure A.6: Three types of asymptotic bounds: a) f(x) = O(g(x)). b) f(x) = 
Q(g(x)). c) f(x) = Olg(z)). 


” 


Consider the asymptotic upper bound. We say that f(x) is “of order big oh of g(x) BIG OH 
(written f(x) = O(g(«)) if there exist constants co and zo such that f(x) < coglx) 
for all x > xo. We shall assume that all our functions are positive and dispense 
with taking absolute values. This means simply that for sufficiently large x, an upper 
bound on f(a) grows no worse than g(x). For instance, if f(x) = a + bx + cx? then 
f(x) = O(a?) because for sufficiently large x, the constant, linear and quadratic terms 
can be “overcome” by proper choice of cy and zp. The generalization to functions 
of two or more variables is straightforward. It should be clear that by the definition 
above, the (big oh) order of a function is not unique. For instance, we can describe 
our particular f(x) as being O(2?), O(x?), O(x%), O(a? In x), and so forth. We use 
big omega notation, Q(-), for lower bounds, and little omega, w(-), for the tightest 
lower bound. Of these, the big oh notation has proven to be most useful since we 
generally want an upper bound on the resources when solving a problem. 

The lower bound on the complexity of the problem is denoted Q(g(x)), and is there- 
fore the lower bound on any algorithm algorithm that solves that problem. Similarly, 
if the complexity of an algorithm is O(g(x)), it is an upper bound on the complexity 
of the problem it solves. The complexity of some problems — such as computing the 
mean of a discrete set — is known, and thus once we have found an algorithm having 
equal complexity, the only possible improvement could be on lowering the constants 
of proportionality. The complexity of other problems — such as inverting a matrix 
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— is not yet known, and if fundamental analysis cannot derive it, we must rely on 
algorithm developers who find algorithms whose complexity 

Approximately. 

Such a rough analysis does not tell us the constants c and xo. For a finite size 
problem it is possible that a particular O(x?) algorithm is simpler than a particular 
O(x?) algorithm, and it is occasionally necessary for us to determine these constants 
to find which of several implemementations is the simplest. Nevertheless, for our 
purposes the big oh notation as just described is generally the best way to describe 
the computational complexity of an algorithm. 

Suppose we have a set of n vectors, each of which is d-dimensional and we want to 
calculate the mean vector. Clearly, this requires O(nd) multiplications. Sometimes we 
stress space and time complexities, which are particularly relevant when contemplat- 
ing parallel hardware implementations. For instance, the d-dimensional sample mean 
could be calculated with d separate processors, each adding n sample values. Thus 
we can describe this implementation as O(d) in space (i.e., the amount of memory 
or possibly the number of processors) and O(n) in time (i.e., number of sequential 
steps). Of course for any particular algorithm there may be a number of time-space 
tradeoffs. 


Bibliographical Remarks 


There are several good books on linear systems, such as [14], and matrix computations 
[8]. Lagrange optimization and related techniques are covered in the definitive book 
[2]. While [13] and [3] are of foundational and historic interest, readers seeking clear 
presentations of the central ideas in probability should consult [10, 7, 6, 21]. A 
handy reference to terms in probability and statistics is [20]. A number of hypothesis 
testing and statistical significance, elementary, such as [24], and more advanced [18, 
25]. Shannon’s foundational paper [22] should be read by all students of pattern 
recognition. It, and many other historically important papers on information theory 
can be found in [23]. An excellent textbook at the level of this one is [5] and readers 
seeking a more abstract and formal treatment should consult [9]. The study of time 
complexity of algorithms began with [12], and space complexity [11, 19]. The multi- 
volume [15, 16, 17] contains a description of computational complexity, the big oh 
and other asymptotic notations. Somewhat more accessible treatments can be found 
in [4] and [1]. 
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asymptotic tight bound, see tight bound, 
asymptotic 

asymptotic upper bound, see upper bound, 
asymptotic 
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vector, 19 
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matrix, see matrix, covariance 
normalized, 16 
cross entropy, see distance, Kullback- 
Leibler 
cross moment, see covariance 


density 
Gaussian 
bivariate, 28 
conditional mean, 30 
marginal, 29 
mean, 23 
univariate, 23 
variance, 23 
joint 
singular, 29 
distance 
Euclidean, 7 
Kullback-Leibler, 33 
Mahalanobis, 23, 27 
distribution 
Gaussian, 23 
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covariance, 28 
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moment, 26 
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joint, 18 
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maximum entropy, 32 
prior, 18 
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dyadic product, see matrix product 
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error function, 25 
Euclidean norm, see distance, Euclidean 
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expectation 
continuous, 20 
entropy, 32 
linearity, 13, 15 
vector, 19 
expected value, 13 
two variables, 15 


factorial, 25 

function 
Dirac delta, 33 
gamma, 25 
Kronecker, 6 
vector valued, 21 


gamma function, see function, gamma 
Gaussian 
table, 24 
unidimensional, 23 
Gaussian derivative, 24-25 
gradient, 8 


Hessian matrix, see matrix, Hessian 
hypothesis 

null, see null hypothesis 
hypothesis testing, 30 


identity matrix, see matrix, identity 
independence 
statistical, 15 
independent variables 
sum, 22 
information 
bit, see bit 
divergence, see distance, Kullback- 
Leibler 
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Leibler 

mutual, 34 
information theory, 32-34 

inner product, 7 


Jacobian, 8 
Jacobian matrix, see matrix, Jacobian 
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tion, Lagrange 
Lagrange undetermined multiplier, 12 
Law of Total Probability, 17 
level curves, 28 
likelihood, 18 
linear independence, 7 
matrix columns, 11 
little omega, 35 
lower bound 
asymptotic, 35 


Mahalanobis distance, see distance, Ma- 
halanobis 
marginal, 14 
distribution, 14 
mass function 
probability, see probability, mass 
function 
matrix 
addition, 6 
adjoint, 11 
anti-symmetric, 6 
covariance, 9 
determinant, 27, 28 
diagonal, 20, 21, 26 
eigenvalues, 20 
inverse, 27, 28 
derivative, 8-9 
determinant, 9-10 
hypervolume, 9 
Hessian, 9 
identity (I), 6 
inverse 
derivative, 8 
inversion, 10-12 
Jacobian, 8 
multiplication, 6 
non-negative, 6 
positive semi-definite, 20 
product, see outer product 
pseudoinverse, 11 
skew-symmetric, 6 
square, 6 
symmetric, 6, 9 
trace, 10 
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maximum entropy, 32 
mean, see expected value 
calculation 
computational complexity, 34 
two variables, 15 
mean vector, see vector, mean 
moment 
cross, see covariance 
second, 13 
multiple integral, 21 
mutual information, see information, 
mutual 


normal, see distribution, Gaussian 
null hypothesis, 30 


optimization 
Lagrange, 12 
outer product, 7, 19 


principal axes, see axes, principal 
prior, 18 
prior distribution, see distribution, prior 
probability 
conditional, 16-17 
density, 20 
joint, 21 
joint, 14, 17 
mass, 16, 20 
joint, 14 
mass function, 12 
total 
law, see Bayes’ rule 
probability theory, 12-24 
product space, 14 


random variable 
discrete, 12 
vector, 18-20 


scalar product, see inner product 
second moment, see moment, second 
significance 
level, see confidence level 
statistical, 30 
space-time tradeoff, 36 
standard deviation, 13, 23 
statistic 
chi-squared, see chi-squared statis- 
tic 
statistical 
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independence 
expectation, 16 
statistical dependence, 16 
statistical independence, see indepen- 
dence, statistical, 16, 20 
Gaussian, 29 
vector, 18 
statistical significance, see significance, 
statistical 
surprise, 32 


Taylor series, 9 
tight bound 

asymptotic (O (-)), 35 
trace, see matrix, trace 
transpose, 6 


unpredictability, see entropy 
upper bound 
asymptotic, 35 


variable 
random 
continuous, 20-21 
discrete, 14 
standardized, 27 
standardized, 23 
variables 
uncorrelated, 16 
variance, 13 
nonlinearity, 14 
two variables, 15 
vector, 6 
addition, 6 
colinearity, 7 
linearly independent, 7 
mean, 19 
orthogonal, 7 
space, 7 
span, 7 
vector product, see outer product 


z score, 23 
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From ther rocas Of the Fired Faltis 


“The first edition of this book, published 30 year ago by Duda and Hart, has been 
a defining book for the field of Pattern Recognition, Stork has done a superb job 
of updating the book, He has undertaken a monumental task of sifting through 30 
years of material in a rapidly growing field and presented another snapshot of the 
field, determining what will be of importance for the next 30 years and incorpo- 
rating it into this second edition. The style is casy to read as in the original book 
The end result is harmonious, leading the reader through many new topic...” 
—Sarger N. ribar, PAD 
Deecros, Conor fer Excellence in Domma Asaler and Rocognatñon, Drunginbod Profesor, 
Department of Computer Soence and Eagmerring, SUNY at Batíalo 


Pracnmonen developing or investigating pattern recognition vwwems m such diverse 
application arcas as apecch recognition, optical character recognition, image pro 
caia, of signal analysis, often face the difficuk task of having to decide among a 
bewildering array of available techoiques. This unique text/professional reference 
provides the information you need to choose the most appropriate method for a 
gren claw of problema, presenting an in-depth, ytematic account of the mayor top 
ha in pattern recognición today. A new edition of a clavac work that helped define 
the field for over a quarter century, this practical book updates and expands the ong 
inal work, focusing on pattern clasification and the immense progress it has expen 
enced in recem years. Special features inchade: 


* Ciar evplananom of both clawecal and new methods, mcloding neural networks, 
wochawsx methods, genetic algonthoms, and theory of kaming 

© Over 150 high-quality, two-color ihtramons highlighting vano concepts 

* Numero worked examples 

® Peudocode far partem recognition algorithm 
Expanded problema, keyed specifically to the text 

* Complete exercises, linked to the text 
Algortives to explain specific pattern-rmecogminon and learning techniques 

* Historical remarks and important references at the end of chapter 

© Appendices coverme the necessary mathematical background 
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